TensorFlow Full Course | Learn TensorFlow in 3 Hours | TensorFlow Tutorial For Beginners | Edureka

Hi Everyone, welcome
to this interesting session on Deep Learning
with Tensorflow Full Course. So before we start
with the session, let’s take a look at topics
for today’s session. So we’ll start today’s
session by understanding what is artificial intelligence. What is machine learning
and different types of machine learning once you understand
the different types of machine learning will look
into the limitations of machine learning and how deep learning solves it
and then we’ll look into the real life use cases
of deep learning. Once you go to the use cases
of deep learning will look into how deep Learning Works how neural network works
the neural network components and then finally get
into what is tensorflow. Once you understand what is tensorflow
will also look into how tensorflow works
the tensorflow functionalities and then look into
what is perceptron. So once you understand what is perceptron also
will finally end the session with perceptron examples. So today we have a special
guest Amit who is going to take the session forward so over
to you Amit. Hello everyone. Welcome to the course, which Deep learning using
tensorflow and my name is Amit. I would like to give a small
introduction about myself. My name is Amit. I have got around 12 years of experience in the field of
machine learning data analytics and deep learning and philosophy
for five years extensively. I have been working in the field of deep learning as
an independent consultant. I’m a mathematics graduate and have done my Master’s
in Applied Mathematics and have done numerous
certifications in the field of deep learning. I have worked on deep learning projects
involving NLP computer vision another structure data analysis
for different Industries and companies as
an independent consultant. If you have not installed
tensorflow and python, please go ahead
and install on your systems. So that’s the first requirement. So let’s get started. So first thing which comes
to our mind they’ve been a lot of emphasis on this term
core official intelligence. So let’s first try to That
what is artificial intelligence on a very high level and why we may need
it in first place for solving a problem. So let’s try to understand
with an example person goes to a doctor and he
wants to get checked it whether he got diabetes or not. And what doctor
would say is, okay, there’s some tests
which you need to get done and based on the test results
doctor would have a look and from his experience
from his studies and previous examples patients. He had seen he would be able
to evaluate the reports and say that the patient
has diabetes or not. So if you just take a step back and think I said
the doctor has experience. So what do we
mean by experience? The doctor has learnt? What are the characteristics
of somebody having diabetes? What if it would be possible if we can provide
this experience in the form of data to a machine and let
machine take this decision whether a person
has diabetes or not. So the experience which doctor
learned through his studies. And his practice what we are doing is we
are taking customers data who have with different reports and different parameters
on different things like the glucose count in blood or the weight and height and all these parameters
about a human being and based on that. We have fed it to a machine until that what are
the characteristics of a person who has diabetes and from this let’s say we have
1 million customers data. We have given to a machine
and let machine do the stuff from his experience
from a machines experience which comes from the data
or historical data to be precise and do the same task
which a doctor is doing. So what we have done is if you see from this example
what an artificial machine artificial intelligent machine
is doing there. It’s learning from
the historical data and trying to do the same thing which an experienced
and intelligent doctor was doing so this kind of area
our domain we can activities which human beings would doing. If he can make machine intelligent enough
to do the same task why we should create these
artificial intelligence based machine systems. There are some very high-level
they may be lot of points. But if you just
discuss a few points that human beings
have limited computational power and we guys may be good
in terms of classifying things. Like, you know, you can see your friend
in a group photograph and easily can sit
with your friend and who are others
you can easily listen to a language and comprehend what a person is doing but human beings
are not very good and doing lot of mathematical if you try doing good amount of mathematical computations
in your head. Probably if not be
very easy and second is that it’s not possible
for human beings to a continuously legit 24 by 7
a day for 30 days and Enos if we can make machine do
such stuff one. They would be able to kind of
do these computations very fast. And like I know we spent
a good amount of time in discussing the gpus
and I also mentioned that Google is talking
about a teepee. Machine transfer Processing Unit
which would be hugely changing the entire
Paradigm of computations and machine would become more
and more competitive or even better than human beings
in some of the fields. This is the formal definition, but I fi Loosely
translated it’s basically that artificial intelligence
machines are those machines which can do tasks which human beings
can easily do so things like identifying what’s written. Let’s say in a license plate
or playing games and I’m sure some of you
would already heard that machine have defeated the co-champion
in the chess players already. Now, we have digital agents
like Siri and others which can understand what we want them to do and can take Intelligent
Decisions from the text or from The Voice itself. Do the glycation is huge. And basically these
are very high level and some of the fields where deep learning is made a great great inroads something
like game-playing expert systems self-driving cars robotics
natural language processing and I’m fond of natural
language processing. I have currently working
on a chatbot on one of my projects. So this is one area
of interest for me. So they may be
different and new areas where we are implementing
all of you know, that everything every experience of human beings is getting
digitalized the kind of things you buy kind
of things you watch and what your preferences are. Who do you like on Facebook who you don’t like what kind
of movies you like and all these things in terms of reviews
being captured online. So once your data is going and captured online
there are systems which can analyze this data. So given this this huge
data generation as well as now we have machines which can process it and make some Intelligent
Decisions are available. So that’s why you will see there
have been a lot of emphasis now in last couple of years. A lot of new things
are coming in some of you who have been
reading these papers on different subjects different
architectures would know that most of these architectures are not Old it’s a very
Dynamic field every day. In fact on a weekly basis. You will be hearing
about a new API or a new kind of architecture being developed
by somebody most of the stuff. We will be studying
in these classes are not very old like
convolutional neural network and recurrent neural network. Some of their variants are
as new as as last year. If you guys follow
tensorflow closely, they introduced a library
called object identification object identification
object detection API, which tensorflow has made
available for everybody you would be able to seat yourself that this API works there
five different options of selecting different
deep learning architectures or convolutional neural
networks for this API, but it’s been able
to identify human beings and all other ninety objects
there with almost 99% accuracy in some cases from even
human beings would be finding it difficult to kind of see
and predict what the object is, but these Machine has gone even
Beyond human capability in terms of identifying objects given that we have
a fair understanding or very high level understanding
we haven’t gotten to details of artificial intelligence. But basically from
a lose understanding that artificial intelligence of making decisions or of
machine making decisions, which human beings were earlier
doing the task something like game playing and natural language
and driving’s of car. Let’s understand how
this machine learning and deep learning are related
to artificial intelligence. So basically artificial
intelligence is a bigger domain. So if you look at robotics, their their involvement of lot of things there
is mechanical inputs that how a machine moves
its some Dynamic kind of agents and all other things within
artificial intelligence domain. We have machine learning
for solving business problems. So you can think of it
on a very high level that machine learning is kind
of brain for artificial agents. So all the data we Get
process make machine learn whether it’s one class
or other class. These kind of decisions
have been taken or been learned through
machine learning algorithms within machine learning
deep learning is further subset of machine learning. So people who come from machine learning background
all the algorithms like regression logistic regression
decision tree random forest or support Vector machines. These were the algorithms you
would have been working till now and I’m sure these are
very powerful algorithms. Some of them are really really helpful in solving
daughter business problems. But the Deep learning is a different kind
of architectures, which do similar tasks. So when I say similar tasks, I mean that either predicting
some numerical number let’s say what would be the sale
next month for a company or if I need to do
classification task there. Somebody will pass
an exam or not, or if you need to identify
some objects within the image if I need to translate language from one type
of language to another one. So all those kind of activities
All problems which machine learning algorithms were
not able to solve till now these algorithms
are being able to solve when we go through
the Deep learning architectures would kind of being able to see that these machine learning
algorithms like regression or logistic regression
still are very much relevant in deep learning and lot of learning a lot
of building blocks of deep learning take intuition
from these algorithms. And this is one of the reasons that we have included some
of the basic algorithms like regression and
logistic regression in class so that we can explain
you some bit of it that how do we identify
in some steps in terms of how do you prepare the data
to make your model learn and what kind of steps you need
to take to prepare your data, which would be common
also not only in machine learning algorithms, but they will also be common in deep learning
architectures as well. So if you look at this, let’s say we have
Up on some flowers and flowers different aspects of it like sepal length sepal
width Peter length and people with these
are the features and features can be
information about a product or a customer anything. Let’s say we’re talking about a customer then
the information can be that what stage of the customer how much does he earn
or how much does he spend every month
on the Company products and all so these are
some information points. And once we give this data
your machine would be able to learn from the data itself, whether it belongs to species
of flower A or B or C or whichever different kind
basically what it is that earlier systems
any system were used for solving such problems people
used to hand code these rules that if this feature is that much then this
is class the A and if this feature is less
than their tits Class B and all that these things
these were hand-coded rules. However, when we
do machine learning as I was saying it learn
from the data and once it has And from the data we
can give a new input and when I say new and bit, I mean that the information
regarding different different features of the flowers the same
sepal length width and all and because your machine
has already learned it would be able to classify that which class this particular flower
belongs given this this learning that now your machine
is able to understand and learn from the data. We can solve multiple business
problems with the help of this. So let’s take a very
small example the same example, which we took earlier when we started discussing
about artificial intelligence. It was like whether a person
has diabetes or not and I was mentioning that this kind of decision
been taken by the doctor based on the reports. He has got and these reports
have some numbers like number of time a patient had
a particular kind of issues or what is the glucose count? And what is BMI
and what the person’s is and Bates of these numbers
the doctor was Able to make this kind of decision we
can take the same analogy where we were trying to protect
the flower which species of flower it is we can take
information of patients and different attributes on
different features of a report and the patient and the Machine
would be able to learn from these data sets and for
a new customer or a new patient, it would be able to classify whether the patient
has diabetes or not. So there are two sections of it. First one is the information
about the patient and different characteristics
of his health. So from this, which is number of time
and glucose count till age. These are the information points
about the patients and the last column
is the information whether the patient has diabetes
or not this kind of problem where we have
some information points where they explain
what the situation is and other in the last column or the information of Output is
in some kind of classes. There is a specific type
of machine learning problem it is but as of now
the characteristic is that we have some information about the patient and
the last column is telling me whether the patient
has diabetes or not. So what your machine
basically learns it that it learns all
those rules in the example, which I was quoting
that earlier cases people use to create these
hand-coded rules to predict whether an event
will happen or not. But in machine learning the
or algorithm will learn from the historical
data and see what are the combinations which decide whether a patient
has diabetes or not. And these combinations would be
of something of this type. It is only for illustration. Nothing to be it’s
not the real numbers, but it’s for illustration that your machine or your machine learning
algorithm is been able to identify these rules based
on the historical data. So it will lay after learning it has created the glucose count
is less than 99.5 if yes then go to next. If know the person
does not have diabetes if glucose count is greater
than sixty six point five. Yes, the person has
diabetes and if no, then they are further
drill down of rules. So all these combinations
or rules are dynamic in nature. And what I mean is that
these rules would be changing if your data says changes and you can take the same
model can do the work whether a patient
has diabetes or not and you can take the same model and make it learn
on a new data set. Let’s say flower species. It would be able to learn
the new rule from itself. So the intuition like human beings
were learning from examples your machine learning algorithms
also learn from examples, but just to frame
our problem statement that machine learning we
know it learn by experiences in from the data
from the historical events. There are three kind of problems
which may be interested in solving first one is called
A supervised learning problem and supervised learning problem
is basically occurs when you have
some input variables and one output column, so both the examples which we discussed till now
one was on the flower species where we are taking data
on different features of a flower and then
which species of Flora tours. So the last column is
the dependent variable or output variable. We are trying to predict and all the information variables
are called input features or input variables. So input features our input variable its kind
of interchangeably been used in different texts
input and output. These are the two different
sections of supervised learning. Why does called supervised
learning another take if I need to explain it that we have a column
to guide the algorithm whether it’s making
the correct decisions or not. So let’s say your model says
person has diabetes, but the actual data says the
person does not have diabetes. So you have some kind
of Correction mechanism. Within your data itself, which can help you
model tune itself to make better predictions. So this kind of output
variable some text. It’s also been called
a teacher variable. So it’s guiding the algorithm
to decide those rules, which I just go
through another type of machine learning
is called unsupervised learning and in unsupervised learning
we have only the input features and our objective is that we should be able
to identify the patterns within the data itself. So some of you who working
in Telecom domain or marketing campaigns, you would be very much familiar
with segmentation analysis or cluster analysis where our objective is
to identify coherent groups within the larger population. We take the customers
as it is the whole population and based on different
parameters and variables. We identify some of the groups
of customers or products, whichever the business
problem is to identify which are similar in nature. So that we can take
either marketing campaign or develop new products for those specific groups final is called a reinforcement
learning reinforcement learning is a kind of learning where the agent learn
from the environment. So it works kind of reward and penalty you can think
of as self flying helicopter. So you leave it
in the environment and it will be deciding based on the wind speed and other
parameters in the environment that how much it should fly
in the reward is that the fuel should
be efficiently be spent and in a more time. It should be spending
in the environment. So supervised learning as I said
that the objective here is that we have some input features and would features
would be holding information about different aspects
of a given problem or customers and we
have an output variable which would be explaining whether the event happened or not some kind
of output variable. So there are two types
within supervised learning one is called a regression
second is classification in the differentiation happens
only because of the type of output column if your output column is the type of numerical values or
continuous values like numbers, so that would be
a regression kind of problem a very
high intuition level. For example, you’re working on a problem
where we need to predict. How much would be
the sale of your company given the information that how much they’re
spending on marketing how many employees are working
which month it is of the year. And if you have this information
you are going to predict what is the million dollar
of sales to company would be doing so
these kind of problems where your output variable is
of numerical values, then it’s a regression problem
on the other hand. If the dependent variable is of categorical nature or of
discrete values it signifies that it’s a classification
problem given that it’s a claw categorical values. Your objective is That how you can put the different
customers or products into different classes. So that’s why the name suggests
classification problem. So as I said there to kind
of supervised learning problems one is regression
and other one is classification. So let’s take a use case where we need to predict
the housing price of a particular locality and we have information
about these houses on different parameters and
these parameters are like these so let’s say what is
the crime rate in that area? How old is the home? The distance is
how far it is from the city? This is from
Boston Housing data. So this is if I’m not wrong, it’s percentage of black
population or some variable. We have the description later on but and what is
the actual price of the home and all these features from crime to I state is
the information about the house. So these are my input features and the output feature is
the price of the house. This is in millions. Dollars and our objective is that we should be able
to fit an algorithm that it should be able to learn
from all the historical homes, which were sold based
on all the features. And what was the price
it was sold for and once the model is strained. It should be able to predict that how much would be
the price for a given home. Let’s take an example. Let’s say one of you is
interested in buying a home in the Boston area
and you would like to know that what is the ballpark figure
for a two-bedroom flat which is of some square
feet and let’s say 20 miles from a specific location. What should be the price so
one way would be you go and talk to people and try to understand that what has been
the average price or if you have this kind
of algorithm available, which can help you understand that given these features
there was the price and if you can create
a regression model it would be able to help you that given some features of the new home which you
are interested in buying what should be the price of it. So as I said, there are two sections one is
the independent variable. We’ll all these information about the home and the last
variable is dependent variable which would be information about
what was the price you see. This is the kind of scatter plot
between the distance from the city and price for the home keeping
all other variables constant. We are not looking
at the influence of other features, but we are looking
if we just need to model or we need to find a relationship between
the distance from the city. And what is the price
from this graph? You can make out that further the city houses
the lower the price would be if we keep all
other things constant. So here it’s like if we can identify this kind
of relationship though. It’s called linear regression. Maybe the line is
not very linear. But for a given distance from the city you
would be able to predict what should be the ballpark
figure for a house. If you just have this information not
all other information, which we have talked
about in a similar fashion how it’s been done is that they would be a relationship between
the price of the house and all the features
which we discussed. So this is only a relationship between one of the variables
distance to the city and the price of the home
in a similar fashion. We would be able to find
relationship between the price and all other features. So all other features if you know like
how old is the home how big it is? What is the crime rate
in our so this kind of model is called
a regression model and it’s a very basic equation
of a straight line Y here is called the dependent
variable a is intercept and B is called slope and X
is called independent variable if we go deeper and
try to understand what it’s basically doing is this equation
is trying to tell me that if I already
know the relationship if I know the value of A
and B from my historical data, which is about different homes
given the value of x I should be able
to calculate the why why in our particular example
is price of the home. Let me try Give an example. What slope means
so slope is what? Is the change
in dependent variable if we change X, which is the independent
variable by one unit. So let’s say if I change X by 1 unit
how much change in happen in why help me understand? What is the kind of relationship between X
and Y and a is the value which tells the value of y when the x is 0 and you can think of it
something like in the text. It used to be like that. If you put the value of x equal to 0 whatever
the value is y then that’s the intercept but basically from intuition
perspective you can think that an intercept is the value
which is there even though you don’t have
any information about X, for example, we were discussing the relationship between
the distance from the city and the housing price even though the house is exactly in the city then they
would be some value and even though the house is
a hundred miles from the city. They would still be some price
so it helped us. End of intentionally
understand the relationship. It also helped
the line to understand where it start weather
is start from the origin or some place within your exes. And this kind of question
is called equation of linear regression because if you see here
the power of X is 1 and so that’s why it’s linear
in nature and it is also that we are fitting
the relationship between X and only one of the variables
multiple regression where what we do is that instead of
finding the relationship between only one variable and the dependent variable in most of the Practical
scenarios the dependent variable Y is dependent
on more than one features. So for an example, your house price is dependent
on all these features all these information
points available. And all you want is that your regression model
would be able to identify relationship giving all
the information together. And then predict what is the price
the equation becomes y equal to b 1 x 1 b 2 x 2 or B 3 x 3 this kind of equation
where B B1 B2 B3 and all these coefficients
help us understand that what is the contribution
of a single or if a given variable
into your regression equation? So let’s have a look that how you can fit
this model in Python. So if you look
at the first block of the code where we are saying import Panda as pain VD import numpy is NP and import numpy
plots live as PLT. This is the convention in Python
to import some of the libraries which we would be using. So these are the libraries which are required
for running this module or this this aggression model. So once we import these all the functions
available in these libraries, we can call them very easily
and we will be seeing that how you can call them. So once you have
imported these libraries if you see that Are loading
the data called Boston and this is the same data set which we have been discussing
in terms of use case here. So next line of code. So here we are importing
the data and loading the Boston data set
the next line of code. We are calling
the pandas library because we have imported
the pandas Library SPD then we are calling a function
called Data frame so that we can, you know, create
a data frame in Python and we are creating
it Boston or data. So it will be creating boss
as a data frames a data frame on a very loose term. You can think of kind of
spreadsheet kind of format where your data is being put
in rows and columns and you can think of an Excel file kind of framework
for the data frame though. It will be different
but just for intuition purpose and after importing I
am calling dot head. So what dot head does it
will be giving top ten rows of my data set. So there are all
the Thirteen columns in Python index
starts from zero. You can see
that index started from 0123 so you can see what this data is
this line of command which says dot columns so dot column gives
you all the features available in your data set. So these are
the different feature names. So these are the different
column names for the data and the price
in the end of the code. I have written actually
one line of code, which can give you all
the details that what the target variable is. What was the history of data where it was recorded and also
you can easily look at this. We are calling
this Boston dot Target and we are calling it
as a boss dot price. So we’re creating a column
in our data frame which was boss
from Boston targets. There is another data Vector available in
the Boston data set itself and now specifically we
are saying Y is equal to this particular variable y we will be representing
our dependent variable. All the features
plus the Boston price. So boss don’t drop price X
is equal to 1 it means that we are dropped
the price variable from the overall data frame
and X is one specified that we are removing the column. So x is 0 represent
the row level operations in Python and X is 1 represent the column level
operations print statement. We are just printing
now the X so this x is all the input features
of our data set. So all these columns which we will be using
for predicting the housing price and how it will be working. So it’s not natural model, but what actually happening is once we have created
the model it will be doing that will be fitting a line which would be going through the actual data set
would be something like this. The price is equal
to some intercept term plus B1 X crime be 2 multiplied
by another variable Z and then be 3 x other variable. So on and so forth and this intercept
and b 1 b 2 and b n would be the coefficient which your model would be learning from
your already available data. And here we are showcasing
the top five values of our housing price. So Y is the dependent variable and we are looking at what
is the top five values? So this was only a very brief
and very basic introduction that how do you import our data how you can see what are
the different columns? It has nothing to do
with machine learning but it is only for people who are new to Python and for
people who have been out of touch in Python, if you just want to brush up
skills and this line of code if you have a look which I’m highlighting
now it is we are using a scikit-learn model
for test train split and what it’s doing is for both because we have already X&Y the test sizes. We are saying .33, so it’s basically we are randomly selecting
33 percent of the data for Putting it separate
in the test bucket so that we can test it later on
and this random State 5 means because we will
be randomly selecting. If you specify a random state
every time you run this code, you will be selecting
the same set of elements from your data set are
help you understand that the variation if you run the code multiple
times by changing the variation is not coming because of
the selection of sample. It should be because of
the different model changes you are making this dot shape
function in Python specify that what is the dimensions and if I mentioned
the first one X trained or trip is giving
me three thirty nine and thirteen basically, it’s telling me
that there are 339 Bros and thirteen columns
in the data set X test. There are 167 Rose
and thirteen columns. And your why test
is just 339 Rose and there’s just one column
or it just one vector the number of rows in X train and why trainer same index test and why test the number
of rows are same because they have been Elected
for the same combination. So same houses. We have selected
the input features as well as the corresponding values
of output and same has been done for the test section. What we are doing here
as I was saying that we have imported
a library called scikit-learn and scikit-learn has different modules for different
machine learning algorithms in linear regression is one
of the modules in scikit-learn so we can call the scikit-learn module from
linear regression called LM and this equal to NN pythons
call assignment variable. So we are assigning LM as
a linear regression module in the scikit-learn. And now what we are doing is if you look at this line
only LM don’t fit. So basically we are telling that use the linear
regression module from scikit-learn and fit
the model between X train and white rain. So basically what we
are telling the model that you learn
those coefficients for different X values given y values
in the training data set. Basically what the fit function
does it calculate the values of intercept b 1 b 2 B 3 4 all
the features in your input. Is for a given y variable. So once we have fit
in the model, and once that has been fit
we can use the same model for doing the predictions. So let me remove it. It should be like this. So LM fit we have fit
in the model and once there has been fit we
can use the learnt model which is LM with a function
called predict X train. So what it will be doing is once it has learned
those coefficients B intercept and B1 B2 B3
for all the features and input you can use
the predict function for making the predictions
for your training data set and you already have
actual values as white rain and then you can compare that how good your model is doing and how you
can compare it there. If you look that we are put
together same thing. We have used for the X test data
set LM not predict X test. And again, the prediction
has been done. So if you see here, I have put together as
a data frame why test and why test bread and
the difference look like this. So the Just value which was 37.6 and the actual value was
thirty-seven point this value then predictive value
this actual value is this n this difference between the actual value and
the predicted value signifies that how much is the error
in your data set? So head been that your model is giving
the same prediction as it was the actual value
you would say there is no error model is
a hundred percent accurate and all the predictions
been made by the model is absolutely a no bang on but normally doesn’t happen
you end up having predictions which are a bit off
from the actual value and we measure
the difference as one of the characteristics or one
of the parameters to identify how correct your model is
in this particular statistic. There are two Matrix
been used for identifying by the most basic been used
is called mean squared errors and what mean squared error is
it is basically the difference between actual You and the predicted value by the model and
what I mean by this is that there’s a this is your predicted value 37.6
and this is your actual value. What you do is you take
the difference of these two and then take the square of it
why we take the square of it because this sign some values
the difference may be negative or positive and if you sum it
up the difference may come to 0 and you may end up thinking that okay model is doing
really good stuff to avoid it. What we do is
that we take the square of it so that the difference
between actual and the predicted becomes positive and you
can sum it up to Showcase that how far your predicted
values are from the actual value and then you take a mean
of it be to Showcase that what is the mean difference
between the actual value and the predicted value
it can also be used for model comparisons here. I can show you
that how it is working. The let’s say you have
some actual values something like this and let’s say
you fit a model. I fit a model it’s there is
one model prediction one. Another model is predicting. Do so, what you can do is
you can take the difference between the actual
and the predict. So 10 minus 2 is 2 and then
you take the square of it, which is for 23 and 21 again to square
of 2 4 then third one. The difference is 5
and then square up is 25 and so and so forth for all the values
you get the total value of sum of squared errors and you divided by
the number of inputs, which is 5 and you get the mean
of the squared errors and you do the same thing
for the second one and if you see it is very less five. So probably it would be able
to help you understand that which model is doing
a better job in terms of predicting the housing prices or any other numerical variable
and there is another statistic which has been
used for identifying how good your model is which is called mean
absolute percentage error. And that’s basically
the absolute difference between the actual value and the predicted
value absolute terms and you sum it up all the values
for all the entries and By number of all
the value of absolute value of your actual values and it
can help you understand that. What is the average percentage your values the predicted
values are different from the actual value. So sometime if you see it will
be somewhere in percentages. So what I’ve done is I have
taken the absolute difference between this value and the predictions I sum it up and divided by some
of my input values. So whatever value comes
in you can say, okay, it’s five percent. So it will be fair to say that your model is 5%
of from the actual values or the error term
in your model is let’s say 5% or 6% and whatever predictions
you’re making from the model, you can keep a buffer
of that percentage when you share it
with the team and what I mean is the let’s say
your mean absolute percentage error is around 10% and it’s
about the sales of a company. So when you share
this focused you say that my predictions
around 90% accurate they may be actual sales may be
plus minus 10 Percentage. So this this can help you giving this kind of variability
in your predictions. However, you have implemented code in Python itself
or the scikit-learn library. You can call mean squared error
the function from sklearn and it can help you calculate
the MSC for a given model. So basically there was a very quick introduction
to linear regression though. There are
different applications, but one thing remain common that we are trying to predict the dependent
variable whose nature or the type of dependent
variable is in numerical or continuous data some of the applications like predicting life expectancy
based on these features, like eating patterns
medications disease Etc You can predict housing prices. We have already
seen the example on that we can predict the weight on different features like
sex weight prior information about parents and all and you can also predict
the crappie love crop based on different parameters
like rain Fallin all And as I said, this is like very
limited use case list. I’m sure people who are working
with sales department. You have to make predictions. How much would be
the sales people who are working with call
centers you need to predict. What would be the number
of calls for next month people who are working with the
marketing you need to predict. What would be the footfall
in a given company or a mall? So there are
different applications of regression models, but one thing is common
across all these applications is that the dependent variable which we are trying to focused
is of continuous data type. So let’s get moving
to the next agenda for logistic regression. So the time of the introduction
to machine learning we discussed there to kind of
supervised learning techniques one is regression and other one is classification in the major differentiating
factor between the two were that in regression. We held a dependent variable
of continuous values and in classification problems, we had a dependent variable
of categorical types. So let’s take an example
of how we can do it. So here let’s take a use case where we have got some
information about some customers in the data set looks like this that we have some customer ID or user IDs gender
of the customer or the user his or her age
estimated salary every month. So you can think of in any one of the currency
either INR or dollars and whether this user
purchased an SUV or not. So as I was saying earlier
the dependent variable here is 0 and 1 so it’s a discrete value
or categorical value, which we need to predict
end the features which we will be using in the
model our age and estimation. We have already discussed
the linear regression why we would need a logistic
regression kind of algorithm. It would be
a straightforward process that if I take purchase as
a numerical value 0 and 1 and I take some input
features like age and estimated salary and you will be right
in saying to some extent that this is a possibility
of doing it. So there are two
major problems coming if we follow this and some of you can help
me what maybe the problems if I try using the linear
regression for solving this kind of problem, but one limitation I
can think of is that here. I’m looking for an output which can give me some kind
of probability that house. Likely, I am to buy
a product or service. So one thing that the limitation
or the Restriction with probability is that the probability terms should be between
0 & 1 & 0 signifies that there is no probability or there is no likelihood
of event happening and one means that it’s certain
that the event will happen. There is no possibility that we can have
probability values less than 0 or greater than 1 so if I’m fitting a linear model
taking the purchased column as my dependent
variable my value is because the linear regression
has no such limitations can pass these values Beyond 1
or less than zero. So what I require is that I fit the model
in the similar fashion, like I did the linear regression
the equation I used earlier. Then I want that information
would be coming from my features in the similar fashion. But what I want actually this is that this why should be mapped
to the values between 0 and 1 and given the limitation we have
just talked about probability that it should be between 0
and 1 it should not go beyond 1 and less than 0 I
need to find ways if I can kind of force fit
or kind of force this y value which would be coming
from this equation and I force fit into a values
between 0 and 1. So to solve this problem. There was a function called
sigmoid activation function which would be extensively
being used in our deep learning as well a different places. But logistic regression comes from this
activation function itself, which is a function
looks something like this that output value
would be 1 divided by 1 plus e to the power minus X and X is not actually the one of the input bits any value
we are giving it and if you fit any value
into this particular equation, it can convert any value
between minus infinity to plus infinity. It will map it
to between 0 and 1 if your value of x
which you are putting in here. I could have selected
a different value different name at least but if you give the highly
negative value the output would be very very close to 0 if this input is positive then
the output would be close to 1 if the value of x
the input here becomes 0 what would be the output 1 by 2 because any values power
0 is equal to 1 + 1 divided by 1 plus 1 would be equal
to 1 by 2 or Point 1/2. So logistic regression is
nothing but an extension of your linear regression
itself with one additional fact that you want to force fit
your output between 0 and 1 and for that you
are using Activation function called sigmoid activation
function or sometimes. It’s also been called
logistic activation function to do the same task. So this is an intuition
behind your logistic regression where you take values
of your equation from intercept and different
coefficients for your input. And you may have these outputs
between 0 and 1. So once we have understood that logistic regression is
nothing but their extension of your linear regression only with the restriction on the
output being mapped between 0 and 1 we are shifting
have you are fitting the regression equation, we would be having scenarios where the value would be going
Beyond one on less than 0 and to avoid the scenario. We fit in logistic
regression with the help of sigmoid activation function, which looks like this
and if you see as I was saying when your value of your model go
beyond let’s say this is our zero so all the values which are positive
and greater than 0 the curve goes. Tangential towards one
and for all the values which are less than 0
it goes towards zero and at the place of 0
the probability is 0.5. So it’s a 50% probability if your output is very
much close to zero or it’s zero. So sigmoid activation function
is been used for that purpose. It can be used
for multiple scenarios. One of the example we
are taking is the example whether somebody will buy
an SUV or not. But if you’re trying
to solve problems, like somebody will say yes
or no to a product or service or whether something is true
or false or high low or any different categories, but logistic regression
can easily be put in for multi-class
classification problems. And basically if I just give
you a very quick introduction how it works is there
in multi-class classification is it kind of does mapping that one class versus rest
of the other classes and then same analogy follows that which class a particular
event would be associated with but end of the day
before which Class or category the
probability is highest. The model will predict that it should be belong to
that particular class like MSC. We have a statistic
or a parameter to evaluate how good your model
is doing to compare how good your model is doing? You have MSC mean squared error. And that was a parameter
to check that. What is the difference
between the actual value and the predicted value
and how we were doing it. We were taking the value which was actually subtracting
the predicted value taking square of it do it
for all the examples and divided by number
of training examples we have and then it gives
you some number and I was also saying that this number is helpful
in kind of comparing different models. So let’s say you fit a model. I fit a model and we compare
MSC for both of them. Whichever model is giving
me a lesser MSC. It is kind of an indication that probably your model
is doing a better job in terms of prediction
then In a similar fashion, we needed a kind
of statistic to see how good your model is doing when your model is doing
a classification problem. So here there
are four categories that let’s say we have only two classes good and bad
actual values good and bad are what your model
is predicting good and bad. So for examples
which belong to good category and your model is also
predicting them good category. So this type of events or examples for
called true positives because your model
is doing correct prediction on positive examples
another category, which is your actual value
for those examples is bad. They belong to bed category and your model is also
predicting them bad. These are called true negatives and these are correct
predictions because whatever the actual value is, your model is also
predicting the same thing. However, there are
two categories. Degrees where your
actual value was bad, but your model is predicting
good these kind of examples are called false positive because your model
is falsely predicting then these are good examples and another category or last category
is called false negative where actual value were good. Any model is predicting bad. So, how do you learn? How does your model say
that which model is doing? Good job. So what we do is we calculate what is the percentage
of values examples have been predicted correctly
these sections in blue, true positive
and true negatives. These are the examples which your model is being able
to predict correctly and these two groups false positive and false negative are
the incorrect predictions. So what we do is we
just want to take what is the percentage
of correct predictions and this metric is also
called confusion Matrix. Let’s say there are some examples out of which
65 examples with their where actually they were
good category examples and your model is also
predicting them as good class good category
examples 44 are those where they belong to the bad category then a model
is also predicting them bed. This is 44 and eight
are actual bed and prediction is good
and for are actually good and predicted bed. What you do is you sum up all
the correct example 64 and 44 and divided it by all the examples in your data set
all correct and incorrect ones and here you get 89% So all you can say that your model is being able
to predict 89% accurately or if you want to explain it
to your business team and say that whenever I give
you a prediction that hundred customers
will churn and I give you a list of a hundred customers
I can With certainty that at least 89 will churn
from them with some certainty. So because your model has given
you 89 percent accuracy. So that’s how it’s been kind of
communicated to business teams that we are thinking that our model is 99% accurate and whatever prediction we are
giving we are very very certain but if your model accuracy
is 70 or 60 percent then when you give the predictions
to a business team you say that okay though. We are giving
you the predictions but we are not very certain whether it will work
correctly or not. So this accuracy percentages
in a similar fashion like we did for linear
regression smsc to identify. How could the predictions are
your accuracy percentage is at the metric to see how close or how correct
the predictions has been. So now we can see the implementation of
logistic regression in Python. So first few lines, if you see we are importing the libraries or
the machine learning libraries, which we require to do
the data manipulations. We are importing the data
which is a CSV format and this data is already
available on your LMS. If you want to import
you can easily import from the LMS itself. Unlike the Boston data set which we were importing
from the library itself here. We have got a flat file as social network ads dot CSV and you can call a read
underscore CSV function of pandas library
to import the data. So you are importing
the data as data set. And as I said head showcase
the top five rows of your data. So here we have only
five columns one is user ID gender age and celery and the last column
is our dependent variable which signifies whether a customer
or a user bought the SUV or not. So it’s 0 and 1 and 1 means the person bought
in the previous code. We used one Convention
of selecting X and Y here we are showcasing
another way of selecting that’s called I log so we are looking for
the location and this convention if I go through
what we are doing here that this is the data set
within the data set we are. Specifying the locations
this colon means that we want all the rows
and as I was saying earlier that in Python
the index starts from 0. So what we are saying
we want column 2 and 3. So what we mean
that this is 0 this is 1 this is 2 and 3. So we want as our input features two
and three and the values so it will be creating an array of these two volumes we
could have used gender, but I will leave it
to you that first. We need to create
the gender as a vector of zero and one so you
can create a function which will say okay if gender is equal
to mail then one else zero or you can create dummy
variables there are function available in scikit-learn. So it’s an exercise for you that this is the code
already available, but I would encourage that if you can also
include gender information into your model the next line of code why we are seeing
the dependent variable is all the rows and column number four. So column number four is
your purchase information whether a customer bought
the SUV or not, and again the values to convert
into a kind of list format. So we have specified two things that two columns
is the information about the input features
all the information about the user in terms of
how much money they make and what their age is
and information of why whether a customer
bought the SUV or not and the next line we are doing
the train and test split for the same stuff to evaluate whether the predictions
been made by the model on the training data set on which the model
was learned is still doing the correct classification
on the data set or the test data set which was not involved
at the time of training and this .25 means that we are selecting 25%
of the data for tests and remaining 75%
for are trained to what is the correct split
of train and test. Normally it is correct to choose
between something like 60/40 or 70/30 or 80/20 if your data set is big enough
then I think having 80/20. Of split is good or whatever. You can try
these different combinations. But as a rule of thumb most of the time I have seen people
taking something like 60/40 or 70/30 kind of distribution between actual value
and the predicted value. Now there is one important thing
for data pre-processing and this selection which I have made for doing
the data pre-processing and some of you who come from the machine learning
background will already know that how important it is
to kind of scale your data. And what do I mean
by scaling there? If you look at the data set which we are using
for input 1 is the age column in second one is the income column age
can be somewhere between let’s say
1 200 a 120 at Max and your income is in like some thousand and
some hundred thousand numbers. Both. These values are on different
scale scaling your data on let’s say all the values between zero and one
will help me understand that what is the importance of each variable
for a for example, if you look at a regression
equation and see those coefficient b 1 b 2 for all the input features these feature these coefficients
can give you kind of indication that how important
a particular variable is. But this intuition
will only be correct. If all my features were
on the same scale if these features like age and celery when they are
on different scales, you will not be able
to compare what these coefficient really mean because they’re two different
scalar values come from so it is always a good idea
to have all your features on the same scale. There are multiple ways of doing
it and they’re multiple type of scaling parameters. The simplest one is called
min/max standardization. And what does it mean is
that for a given column? Let’s say we are talking
about age column, which is 1935 26 if I need
to do min max standardization. What do I mean is
that I take the value it is that’s a 19 and
minimum value here is let’s say 19. I have only these five values. So how it works is that this is the formula
this is the value or how it’s being presented X i- the minimum value of the column. So let’s say age,
/ Max of age – minimum of range as well. But basically what
this formula will do if I do it for all the values
in the age column, it will be converting
all the values between 0 and 1 and there are other ways. I also said that there are
normalization process which is like you take the value minus the average value divided
by the standard deviation if I call it correctly. So whichever method we
apply all I’m saying is that these values of age and celery should be brought
to the same scale. So if I’m applying
this min-max standardization, I’ll apply to both my columns so that both these variables
are on the same scale and I should be able
to use them in my model and this is again a very important thing that whenever
you do standardization, you will be using this process that you fit the normalization or standard scalar
on the training data set and you use the same learn
to standardization from the On the test data set
but basically how it helps that your data set on which
your model is been trained. It will be converting the values
between zero and one based on the minimum
and maximum values if the test data set
have different minimum and maximum values. It can have different value
for the same number. So that’s why the process is that we make
our standardization fit on the training data set and use
the same minimum maximum value for test normalization as well. It gives the same scale
for all the values and for model predictions. It’s very helpful. In a similar fashion, like
we called a linear regression object from scikit-learn
in the previous example, in exact same way. We can call a logistic
regression function from the scikit-learn. So it’s a scikit-learn linear
model and we are importing contest aggression. Now, we are fitting
the logistic regression between X train and why trained the same way
we did it earlier for the linear regression. And once it has been fit
we can do the prediction for test data set. We are also doing
the same thing that we are calling the function which was classifier
for the logistic regression and we are doing it
on the X test data set and here the default
probability is 0.5. So what your algorithm is doing in the back end for all
the examples wherever the probability in X test
became greater than 0.5. It was tagged as
one and for all the examples where probability was less
than equal to point five. It was tagged as 0 and now
Are calling this function called confusion Matrix
between why test and wipe red so we are comparing that what is the values
of the or true positive true- false positive
and false negative. So this is the values that these are
the true positive. These are true negative and these are
the Miss classification values. And if you want
to calculate the accuracy, you can easily do it by 65 plus 24 divided by 65
plus 24 plus 8 plus 3. So all we are doing is we
are trying to identify. What is the percentage
of correctly predicted numbers? And this is the code of section
so it can do the prediction. If you see what it has done. Basically if you
look at the section that your regression model
has fit this line and you can see
the straight line and that is why in some
of the text logistic regression is also being called
a linear classifier and why it is called
linear classifier because it is
predominantly being made. For fitting a linear equation the logistic regression equation
was y equal to a plus b 1 x 1 b 2 x 2 and all
these coefficients and respective inputs. But the highest power
of your inputs were one and you would already know that if it’s a polynomial of power one it stand
for a straight line. So that’s why you
can see a straight line. There are ways some
of you would argue that in a we can fit
a nonlinear line with the help of logistic regression, but you would also concede that there are some tricks
which we use for creating non linear lines
through logistic regression. For example, you introduce
higher order polynomials into your model so that the separation
becomes nonlinear these kind of algorithms are really helpful
only solving the problems when the objects
are linearly separable when the separation between the objects is
not linearly separable. These kind of algorithms
are not very helpful, and we need to identify
algorithms which can fit in no. No. A hypothesis or
separation boundaries between different classes. So let’s take a use
case to understand that what are
the simple scenarios where unsupervised
learning can be used and how does it really work? So let’s take an example that we have some housing data
and housing data in terms that what their locations are and these white dots
on the screen in the blue background showcase that where these homes
are located and the objective of Education officer is that he needs to find
a few locations where the school’s can be set up
and the constraint is that students don’t have
to travel much. So given this constraint in mind the officer needs
to decide the location. They may be easily
we can identify if we are not using any algorithm. So let’s say if I know that I am an officer I
need to open three schools in the locality and I
know the information where the homes are located
I can easily see. Okay, probably this
is one location. I’m just highlighting it and the constraint I
also Mentioned that student don’t have to travel
much what I mean is that if you open the school here
then everybody of you would say that it’s not a great location
for a school given that it’s far away
from the population. So this is not
the correct location and from the perspective of identifying the home
probably these three from a human intervention or like some of you
has been given the task without any algorithm. You can decide the probabilities
are three locations. If you set up the school, most of the students would be
traveling less to go to school. So given this problem
we can easily see that we don’t have
a dependent variable as such which is telling us whether it’s the correct
location or not. All we are doing is that we have number of locations which we need to find schools
for and then we have home locations and based
on the distance of each home. We need to identify which with the proper distance
a proper locations of these schools. And another thing that is coming
from the same logic that there’s no predefined
classes of these locations and one more point. If you would like
to add in some of you who have done the clustering or the segmentation job
in your respective works the these numbers
we say three or four or five it’s not predefined. It is most of the time
given by the business that how many clusters or segments they
would be looking for though. There are statistical
ways of identifying that which is the best number
of cluster should be but basically most of the time
it would be coming from somebody in the business that okay. I see that lets say I
was working for one of the Indian telecom companies
here quite a time back and at that time
their subscriber base was around 300 million
customers and imagine that if you’re trying
to create segments for this bigger population, and if you create
three or four clusters, you can easily understand that it would be very difficult
for marketing team or any product team
to design products for such a big population so though statistically
it Look that okay, four or five unique
segments are there but you end up creating a lot
of small small segments and they may be a possibility that you will be creating
20 or 30 segments for such a big population. So my intent
of saying this number that we are trying
to identify three locations within the population has
to be decided either by business or people like you who have knowledge
about the data as well as that what kind
of business they are running and what is the final usage
of this segmentation exercise? So, let’s see one way of doing or selection
of these School locations is like we have already doing it. If you are in defy that somebody looked
at the homes and Casey the density is where the density is high and
selected the home automatically, but there are algorithms
also available to do the task and I can give the name here itself its key
K means algorithm. And so first we
would like to understand that how does an algorithm work if it needs to identify
which is the best location. So if you are looking
at my screen, let’s Our objective is that we need to identify
two locations first and we have some data
and scatter plot available. And we need to identify
where the school should be so that the distance
from home should be minimum if that’s our objective. So how we can do it that lets say we
randomly assign two points from the existing data set
and actually easiest ways that you randomly
pick two numbers from your data set itself. And then what you do is you
assign these two selected points as these are
your cluster centroid. So this is the center
of your selected population and in the Second Step. So once you have initialized
these two random points, then the next step is that you measure the distance
of all the homes from the initial selected point. So let’s say you do
the distance of this home from the selected point
and again from this that for each house
from these randomly initialized point we
Measure the distance from the selected point or the initialized point and any home location and see
which distance is minimum or which distance is less
in comparison to the other from the selected initial point
so we can easily see that this distance is smaller
than this distance and this point would be assigned
to this particular group. The first initialization step is initialized as
many number of centroid as many clusters you need and in the Second Step you
do the cluster assignment and in cluster assignment
how it’s been done is that you measure the distance from these initialized
points and see wherever the distance is minimum and then assign this home to that particular
segment or cluster. So this exercise
has been done for each home and just trying to show
for a couple of them and based on the distance. The assignment is complete. So this color also signifies what we have done is
after measuring the distance for each home from the initial points we
have assigned These points to this cluster and these Blue Points
to Second cluster. And then once this
assignment is complete, it moves the centroid. So what we’ll be doing is
it will be taking the center of all the selected points and then it will
move the centroid from the previous point to the next Point based
on the new assignment, which is already
been completed and then what’s been done is
the same exercise which was done earlier
in terms of cluster assignment that we measure the distance of each home from the centroid
so distance from this centroid and this centroid and wherever its minimum assign
it to that particular cluster and this process
has been repeated again for both the centroid and once the distance
is being measured on the improved
or changed centroid again, the assignment process
has been started. So once you have moved and then measure the distance
and then assignment also changes like it was done
in the previous step and we continue this process
till the time we have reached a The other point where this change in assigned
have stopped completely. So once we have reached
this kind of place or this kind of scenario, where as many time you measure
the distance from the centroid to the different points, your centroid does
not change this exercise or this point is called that your model has converged
and at that point you can say, okay these all group there
is one group of these points or these homes. So this is one cluster
and second one is this cluster? So this is how k-means work. It has wide
variety of applications. There is a function available
in scikit-learn library. You can try implement
it the intent of showcasing you this example
of unsupervised learning Wars that we will be having
two algorithms which come from unsupervised learning
section of machine learning and these would be your restricted boltzmann
machines and autoencoders which work on a similar
methodology of unsupervised. Learning so in a similar fashion like we started discussing
in the beginning that where should be
the location of these schools. We can use a key means
algorithm and initialize three points randomly and do this distance
measure to each home and assign the homes to a cluster wherever
the distance is minimum and we continue this process
of measuring the distance and assigning it to the cluster till the time these value
have been converged. The most important task
for any data scientist is not to remember
which library is required or what are the codes in my understanding
the most important thing which data scientist
should remember is that once you’ve been given
a business problem first, you should be able to understand that what kind
of problem it is, whether it’s a problem
of supervised learning or it’s an unsupervised learning given its a supervised
learning problem, whether it falls
into the regression type or a logistic regression type. If you can make
these decisions then for implementing the algorithm
you will find a lot of help in fact scikit-learn
would have initial codes for almost every algorithm. So you don’t have to remember
line of code and algorithms. All you should be able to do is once a problem
is been given to you. You should be able to identify
what kind of problem it is most of the time
in unsupervised learning and specifically in
k-means kind of models. We use this elbow method
as indicator or help. You understand that
what is probably a number we should start with for starting
the final implementation of your model, so Think give you an intuition. How does it work SSD stands
for sum of squared errors and what it means is actually if I go back a little that suppose you have identified
these two clusters. So sum of square error would be that you take the centroid and measure the distance
for the points which are associated
with this cluster. So you measure the distance for
each point in the orange group and some Square
all the distances and the same exercise being done
for the blue points and whatever the total number comes in
after doing this exercise. You will be getting what is the total number
of squared errors. And if you have two clusters, you would have some number
and just for intuition I’m saying this is total sum
of the square is coming as hundred and that’s only
for intuition and example. I’m taking this number
to help you. Let’s say there was
one more cluster somebody identified here and all
these three points do it. It’s blue in color, but I’m seeing all
these three point belong to this particular segment. And rest of these points
remain the same as it was previously and
as we saw with two clusters are some of the square
was coming as a hundred when we have three you can see that
these points are bit far off from this particular cluster. So if I will be doing it
with three clusters, this distance would be
a bit less given that now I have a point which is closer to these points
and whatever error or distance these three points
were adding it would be bit less given the cluster was
here and let’s say this distance goes down to 95. I’m just making up some numbers. So probably what
it is telling me that sum of squared is going down and probably I
am finding clusters which are closer or more closer
to the actual data points. And as you would know that if I will be increasing
the number of clusters in the population, this distance would be
going down hopefully and this distance can go up to 0 and at what point
this distance can go up 2-0 when every Point become
a cluster Self so if let’s say I have
20 data points there and I assign that every point is
a cluster in itself. Then just measuring
the distance from the point which would be 0
and overall SSD will become 0 so it may start
from a very high number but it will be reducing with
each cluster point or cluster. You will be adding to your data. So this line which is some time
being called the elbow method what it’s actually showing you that when you had one cluster
this was the distance. So if you had one cluster only
anywhere in the population and you do sum
of the squared distances, this was the distance
when you had two clusters, this was the distance when you hit three
these were the distance when you had for this was the distance, but when you had five the sum
of the squared error did not reduce much. So if you see it’s like
very less and after that even though you keep on
adding different clusters, the sum of squared errors
is not going down. So as I was saying that this process
or this method It is kind of indicative method
and it gives you an intuition that if I have done it my cluster analysis with
different number of clusters and I’m measuring the sum
of squared errors for given number
of clusters and I see that after for the the sum of squared error
is not going down. It gives me an indication that probably I
have found clusters which are more or less coherent and the population
is not very much away from the centroid
from the point. You can make it an assumption that probably four clusters
is a good idea for my given population. But as I also mentioned that it is just
indicative process, it’s a good starting point, but you need to see that how the distribution
of your cluster look like whether they solve
the business problem. You’re trying to solve or not. And if not that whether you need
to further divide the Clusters which your initial
model has identified. And here will be taking
very quick introduction to a third type of learning which is called reinforcement
learning what it actually is we have seen from the
to learning types. There is supervised learning
and unsupervised learning. The first one was that we are trying to predict
some dependent variable in the second one. We are trying to identify
some kind of structure in the data set or
if I put it into other words that we are trying to identify
some kind of coherent groups in the population third one
is reinforcement learning and it’s basically that any object or a system learns
from the environment and there is no right
or wrong answer given to the system explicitly
or in the beginning itself, like in the case
of supervised learning here. The object would be moving in the environment
and an example that self flying helicopters where they fly on its own
and they take the decision that what is the wind speed and
what is the pressure around it and they correct
their procedure accordingly and the objective
they need to You is that they need to fly
for a longer period of time. So here we are given an example. Let’s say we have a robotic dog and somebody needs to train
it to take correct decisions and correct things would be that it walking on the path
of where people needs to walk and it’s not going
down from the path and if some task is been given
it working correctly. So there are two components
of reinforcement learning which is called
reward and penalty if the object or the system does
the correct thing it receives some reward in terms
of mathematical things. Obviously, we will be providing
everything in terms of mathematical numbers. And if it does the wrong thing
to receive the penalty and basis this thing it
will keep on taking HD Seasons. So like a dog
if it’s working correctly. It receives points, like ball is being thrown
if the robotic dogs go and pick it up. It’s a reward point. If it doesn’t do
the correct thing, it receives a penalty. So most of like all these
reinforcements Agent working in a similar fashion. Some of you who are interested
in implementing it there is an algorithm called
Deep Q sin algorithm where you can design
your own system and you can assign what are the rewards
and penalties similar fashion reinforcement learning is also
interacting with the space as I mentioned. So self-driving car is
also one of the examples which would be receiving rewards
and penalties based on whether it’s running on the track taking
the right turns and moving at the correct speed maintaining
distance from other cars which are running. So reinforcement learning
has a huge implementation or requirement for
self-driving cars or some of the components of it. Not all some of the components also in self-driving car
are supervised learning for the point that car
needs to understand what the objects are in front of it and all other
objects identification. So what are the real limitations
of machine learning given that we already have all three type of algorithms
supervised or unsupervised and reinforcement
learning algorithm. D available then why
we want a new architecture or new type of algorithms for
artificial intelligence systems first and foremost
is the dimensions. And when I say Dimension, it’s like the type of data
we get from lot of sources. Let’s say we receive images
which is grid like image. So where the pixels and what is the strength
of pixels in the image natural language processing
so language data comes in a different length, and you know, the work is also
different in the sense that suppose we need to design
a machine learning algorithm which can do
language translation. And if you conceptualize this
idea of language translation from a machine learning
algorithm perspective your inputs become a sequence of words in your output
is also sequence of word and some of you who are working
in machine learning algorithm. Try thinking that whether we have any algorithm
currently available, like logistic regression
or decision tree, which can help me even fit the
Or fit the problem leave aside how good the accuracy would be
an all but these problems which come from a different type
of data source, and we trying to solve
a different kind of problem like language translation
or chatbot kind of problem where you give a sequence
and it returns your sequence. So these kind of Architecture is already not available
in machine learning. So there is one of the reasons that we need to
identify some algorithms which can deal with such
data sets like images and languages and second. It can also fit
different kind of models which are not only
for predicting or classifying but also give you some kind of values like sequence I take
an example of so there is one of the reasons first we
are looking for a different type of architecture for solving such
problems then second problem, which machine learning
algorithms are not very good in dealing with
the dimensionality. So we would have seen and with a size of let’s say
a thousand very It’s and that your 100,000 rows
and thousand columns. Probably you can still fit some of the machine learning
algorithms on top of it. But given the kind of problems. We are dealing with like
images every image. Let’s say it’s 200 by 200 means
200 pixels by 200 pixels and it’s a colored image. It means there are
three channels will be discussing in details, but basically all
I’m trying to say that is simple image of 200
by 200 pixel will be giving you if you do the maths
200 multiplied by 200 and image. So let’s say this is your image
and it’s 200 by 200 because every image is
kind of a matrix only and if it’s a colored image
actually colored image are being represented in system through three channels
red green and blue so there would be
three such grids but one top of the other so
number of pixels you need to have to represent your image in the system or in your
algorithm would be 200 by 200 by And then you calculate
how many features it would be if I found meth is correct. It should be
like 120,000 features. So even a simple image
of such small Dimensions you end up getting
120,000 features and plus if you are really working
on a complex problem solving in terms of let’s say an object
identification in the images. They may be five or six objects, which you need to identify
and you’re dealing with let’s say 100,000
images then your scale of data become so huge for any machine learning
algorithm to easily handle it and your machine learning
algorithms fail in terms of getting any interesting
results out of it. So coming to solution part
that we need an architecture which can not only read
such data in terms of images, but it is capable enough of dealing with such
huge dimensions of data. So this is the second benefit which comes from
the deep learning algorithms and we will be discussing
How do they And it’s such pie dimensional data when we go and talk
about different architectures and third and the
most important reason that we will be looking
for a different kind of model structure
or different kind of algorithm is for identifying the features. So let me spend a couple
of minutes on this idea. What do we mean
by identifying features? So in machine learning
algorithms vs data scientist spend a lot of time in kind of curating the important
features either first, you’ll be scaling the features and after scaling
you’ll be creating the interaction variables. Then you’ll be creating if the separator
is not very clear. Then you need to introduce
High dimensional data. Let’s say it’s your data point
and if you see that 9 you’re fitting is
not separating clearly then some of you would be trying
the higher order polynomials of your input features. So all such things which not only
difficult to you know, come up there is a lot
of trial and error that which kind
of transformation and which kind
of Rebel creation so what kind of variable will really work
for classification problem? That’s first thing and second is if you’re working
on higher order polynomials, what is the correct
order of polynomial? I need to create it in just
to give you the scale of it. Let’s say you are dealing
with only a hundred features and you need to create
second-order polynomial with interaction of
all these hundred features. Then you will end up getting
around 5,000 features from the second-order polynomial only if you want to get
third order polynomial like Cube variables or the interaction
of three variables together, then these hundred
variables will come around 170,000 features. So this creation of features is
very very difficult and given that in our image is
if you look at the image, which is in front of you if we go and start creating
these features on our own and our objective. Let’s say to identify a
television in the image suppose, which I’m highlighting here and we Some pictures where we
need to identify even though you have created
those features manually and some of you who are working in the field
of computer science for quite some time
would know that earlier. We used to use features
like sift Saft power features and hog features, but these are like kind
of static features for a given object
but we may argue that this television is there
in this picture here, but in other picture,
it can be somewhere else. So the feature which I’m identifying it has
to be special in difference that it can be
anywhere in the image and same goes for language that you’re dealing
with language data should be not only able to understand
the meaning of word or how does the word
fit into the sentence but should also
be able to understand that what is the context
of each word but these word embeddings neural network help
you understand it. What are the related word to a given word and from that
you make predictions, so these broad problems
of machine learning. Them’s one is they
are not being able to play with or deal with different type
of data like images and natural language second
is the dimensional problem if the dimensionality goes in
like 100,000 features and all and third is this feature
creation on its own. So these are the three basic
reasons that one of you are all of you would be interested in going to one of the deep learning
architecture for solving such problems and forth. If I may add it that all the Deep
learning architectures given that we are putting a lot
of computational powers in them. They end up giving you a better
accuracy both for classification and regression problems. So that that’s
the fourth benefit and how does it really
work their different stages in a deep learning
and why they are called Deep because it’s not just
input and output like we have seen in regression that you have a y NX some kind
of linear equation here. We have different intermediate
field like the Seeing in the screen like these are but there are a lot
of intermediary field for doing such complex calculations so that all these features
which I mentioned that suppose you need
to identify television all such features get calculated at different stages one
after the other and final stage. You have very very
refined features. Not only for image. We are taking
the image classification. But any problem we
are trying to solve through the multiple stages your model would be able to
learn these intelligent features which are really important
for your classification or regression or
any such problem, which we are trying to solve. So these were the few benefits
for deep learning and these are actually
the broad reasons that somebody would be interested in learning
the Deep learning algorithms and some of them we would be if you can see the screen now that the implementation
of deep learning has been into almost all the areas
either images in the language or even the structured data which we have Been working using
machine learning algorithms. There is a huge implementation
of neural networks. Now in dealing with structured data as well
for predicting like churn or who will be buying
the product or not. Even these kind of algorithms
are moving towards deep learning so though you would have seen
a lot of implementation from images or language only but the application of deep learning is now
happening in almost every field and the different architectures which are been implemented
for different such problems. So if you see some of the applications
are already here, like automatic machine
translation object classification in photographs, then there is a library
or there is an API being introduced by tensorflow
just a couple of months back which is called
object detection. So object detection API,
it’s very strong. API have used it already and it can help you classify
almost 90 objects in an image and it’s so powerful that it’s accuracy is almost
99 percent in some time. It’s Even the human
visual Powers we have handwriting generation. So there are new kind
of architectures, which is not in the scope, but there is a new kind
of algorithms called generative adversarial models or networks, which is called gain Gan and these networks
are really powerful in generating new data set
and what I mean is that you give some images let’s say you have
some thousand images and you one generate fake images
or images from this data, which looks similar
to the images. You are ready have it means
you can generate your own data from the existing data set. So we have a separate algorithm
and image captioning is there that you can embed
different models together for generating text
from the image that what’s happening in the image. You could have seen
the image captioning and game playing we have already
seen automatic car drive. And also there’s
a lot of applications. This is just a small
list of applications one small example is Google Lens
you can try installing and see how does it work. It can read a text
from the image itself. It can identify
different objects. So basically they’re
huge implementation of deep learning modules and algorithms in different
spheres of business problems and different kind of data and different kind
of business problems. Some of them are here
what these tensors are so these tensors are nothing
but area of numbers. So these are just
multi-dimensional arrays where the numbers
are being represented in terms of Matrix as we call that this is a matrix
3 and 3 6 and 4 inch by six by four Matrix we
can see this is just a vector and they’re ranked represent
that how many dimensions are so if you look at the vector
is just one-dimensional tensor. There are two Dimensions
one this second is this so it’s second dimensional and
it’s a three dimensional array. So it’s like three dimensions
here and people who are familiar
with multi-dimensional. Is it should be fine? And if not that probably this
example can help you visualize that what we are talking about. So number seven, which is just a scalar in Matrix introductional Matrix
computations would have heard that single number
is called scalar and here it’s rank is 0
they just a number any Vector of any length would
be called of rank 1 this is 1 dimensional array. This is a two dimensional array and most of the data
would be coming in 2 dimensional array shape like just to give
you an intuition that you can think of these as different
columns age gender income City and all those different numbers so you can think
of these columns and these are rose. So whatever the data
we have been dealing with structured data form, you can think of for
the structured data. So this is
a two dimensional data, or we can call it also
or data of Rank 2. Also the image data which comes in this kind
of format so image is also been either in two
dimensional three dimensional. But basically if you
look at the image, there are some pixel values on the image and based
on those pixel values. The image comes up on the screen so image is also
a two dimensional array if it’s a black and white image. However, if it’s a colored image and I also mentioned I
think just some time back that there are three channels or three colors which make
different kind of colors based on the combination
of three channels. So basically a color
image look like this that they would be
three arrays of numbers for three different colors. So RGB red green and blue and they would be
pixel intensity on each one of those something like which
you are presenting here and based on the pixel intensity
of all these three on top of it different colors
would be coming. So in a combination
of these three pixels for three channels would be
generating different kind of colors on the screen. If you have this kind of data, it’s called rank 3 because there are three
dimensions your data can be of more than three dimensions. You would know that it will be
very difficult to visualize. So this is a basic understanding or introduction to
what these tensors are and how do we mean
by different Rank and dimensions of tensor? So let’s get started. So one thing is the takeaway
from the point. We have discussed
in couple of Sliders that data has to be converted or data has to be represented
in terms of tensors before we take
any kind of calculations or mathematical computations
intensive look shape again, we have seen that if it’s a number only
its Dimension would be 0 but the shape of a vector if they would be
different functions and we have already seen
in the previous example, if you recall when we
were running the regression and logistic regression, I ran this shape function so x
dot shape and Y dot shape when I was running it was giving
me a number of rows and columns when I ran it for Y where it will be just gave
me the number of rows. So it’s in the similar fashion
is just a one-dimensional array and there are
five objects in it. So the If is
54 a two-dimensional array that Matrix have a shape
of number of rows and columns. So this is a shape of this in this and for
a three-dimensional little be like rows columns and the depth
of the multi-dimensional array so there will be
five four and three. So that’s how you
need to read it. So five in this direction
five in this direction for in this direction
and three in the depth. So every data point
and why it is really important and why I’m spending
a good amount of time on this that even today most of the time we end up making
a lot of mistakes in terms of defining the dimensions
of the data and you will see that when you really
design and architecture for deep learning you will
be mentioning lot of points. Like what is the dimension
of your inputs? What is the dimension of output? What is the dimension
of different weights and they would be
huge amount of Weights. Actually, this is the only
big problem in tensorflow or any other deep learning is that you still need to define
the weight dimensions on your own and most of the time problem
comes from This section itself when you go and design your own
deep learning architecture, you need to be really cautious
about the dimensions. So there are two sections
intensive flow Library, which is nothing but a tensor flow is
an open source library, at least for now
from Google It Started from the Google brain project and now it’s available
to all of us free of cost. There are two sections
of tensorflow one is the tensor which we have already discussed that we need to convert
our data in terms of numerical representation of every data and then
second is flow and people who come from the programming
background would already know that what is the lazy evaluation
but for those who are new so what do we mean on intuitive level that
we give our inputs? So I’ll give you
some names here. So X is our input
features WR weights associated with it met
melisma tricks multiplication. So you take X and W do the
matrix multiplication ad Buy? S term so we are taking
the bias and adding it and then we are planning one of the activation functions
whatever the output comes from this you apply the activation
function called radio. So intense a flow
how it would be running that you will not be getting
output for each of these inputs till the time. We don’t explicitly call it. So how it works in tensorflow is that you design the entire flow
of your algorithm or your program and only
you run this last component, which is the real ooh, and what will happen is that when you run this command or just really operation
it will go back and automatically calculate all
of them in the back end. You can see the output
whenever you want, but that’s how it works that you don’t have
to run it individually. You just run the last part and it goes back and run
the intermediate or E sections of your functions. As I said, there are two sections
within the tensor flow which we already seen
the tensor and flow and within flow it would be done in two sections one you
To define the graph so you will be defining that what old functions and water calculations
you require in the program and then you have
to explicitly run it once you have installed
tensorflow in your system calling the tensorflow
library is as simple or as we have done
for the libraries like pandas or numpy it remain absolutely
the same way that you import tensorflow SDF and it just the name you
can give any name instead of d f and I’m running this code on 2.7 if somebody is still
working on 2.7. It should be fine. There are three type of data objects are
data types in tensorflow and all the programs are
everything you will be written in tensorflow from now on
would be one of its type data. So you will be explicitly
telling the program that I’m writing this
whatever line of code and whatever assignment
function you will be using a for one type of these data. There are three basic types, and these would
be used extensively across your programs. So first one is a constant
and constant intense flow is absolutely the same whatever our understanding
has been of a constant that constant is a value which doesn’t change
whatever value be assigned to a constant. It remains same for example, if I say equal to 5
and if I think that’s a constant then the value of a will always remain five
it doesn’t change at all. So the same convention remain
in tensorflow that if I have specified
a data object as constant its value
would be same across the program and will not be changing and I can assign constant
as a string or a number or an integer. It can be any of the times. So let’s see a hello world
program in tensorflow that how you can run
it that hello world. I’m assigning hello
SDF not constant. This is vapor because I have installed or I
have imported tensorflow stf. Now once whatever name you assign it as then
you would be calling all the functions with
the same abbreviation itself. So here I am saying that TF naught constant and I’m giving the word
as hello world and though it has been in all
our programs to now in Python that once I have assigned
a value of some object if I want to run it, it should give me the value but here you see it’s
not giving you any output. My expectation was it
should be giving me hello world, but it’s giving me TF not tensor
constant shape D type is string but not giving the actual value which I wanted
and why it is not giving because that there
are two components of a tensor flow program one was that you do design
or set up the graph and second is running it. So intensive know everything has
to be run within a session. So this session is
the Running part. So we have to explicitly
tell the program that run this command or run this object
within tensorflow session. And this is one of the ways of writing
a tensorflow session command. Did he have not session
an S has to be a bit later as says and then you
have to use actually this command says don’t run
and then you say what you want to run it for. So if you see
I’m calling an output by running says don’t run and then I’m calling
this hello constant and then I’m asking
that print the output and if you do it now, it would be giving me the output
which I expected earlier. So this convention
would remain same across all the programs from now on for tens of flow that
whenever you have done the assignment part of it. You have to run this function or any line of code
within a tensorflow session only and one of the ways
of doing is this so they’re different type
of data like float32 is residual value. They would be in T and again
into and these 32 64 is like the bit shape of your data
being designed and people who come from
programming would know that how many bits
being taken to Define your number in the back end
from that perspective. There are two or three types, which will be working with
extensively one is the float and most of the time we
will be using float32. And other one is in which we will be using
for specifying integers. But for Constance the float
32 is a default type of data till the time we go ahead
and explicitly mention that we want to specify constant as nth then you have to tell
that the constant which I am specifying. Let’s say Node 1 is
d f dot constant and I’m specifying what
this value is and I’m specifying whether it’s float type. And if I have just given
the decimal here it is easy to understand for the algorithm that it’s a float
32 kind of data and as in the previous command, we saw that when we
run this code, we don’t get
the output of Node 1, which is we were expecting
three and four node to we were expecting for but we get this information that this is just a tensor and what are the type
of data each constant has and like we run the session
in the previous. Example I was saying there was
one of the ways where we said that with tensorflow
session as some name and then run it we
can Define it in another way. Like we can say sesh
equal to TF dot session and then I can run
whatever command I want to do. So here I can do it
that print says don’t run Node 1 and node 2 and says closed
this one is very important that if we are
specifying our session in this particular fashion where we are saying says equal
to says not session then we have to explicitly say that now close the session. If you don’t close the session
it would be running and they may be some issues
from the point that suppose you have one Node
1 multiple times in your code. So they may be a possibility that it may pick
up the wrong value. So if you are
specifying your session in this particular way you
need to say says Dot close and if you running
from the previous way like we did here
with tensorflow session. So in this particular fashion, if you run the command
with tensorflow session whenever the Is done
it closes automatically so you don’t have
to worry about that. You have to run this command
of says dot Crews. So there are
two ways you can use tensorflow dot sessions here. I wanted to show you that there are very
simple calculations which you can do. So here I’m showcasing
this flow understanding when I presented
the first slide on tensorflow, I was saying there are
two components tensor and flow so tensors you can Define
like this as a constant. So this is one of the ways
of defining those tensors and then I am doing some kind
of mathematical computation. So let’s say I have said that I want to assign see
as a multiplication of B. So this is
my formula or function, which I need to achieve as all of you know
that we will be doing that a lot of computations
in our deep learning program. But here I am showcasing
very simple one that if I need to multiply a
with b 2 it remains same as it would have been earlier that it doesn’t give
me any output. So if I run see I don’t have to recall
like in the previous function if I just go back A bit here. If you see I’m running my Node
1 and node 2 which is nothing but the assignment of constant but what I’m trying to show here
from the flow perspective, which I mentioned
in the slide that when you have designed
a program something like this where you specify A and B
and then you also specify what is the mathematical
relationship between a and b and what is the resultant value? All you do is you run
the final value or final step in your overall program
and here C is the final step. So I only run see and
what it does in the back end is that will be running CNC OKC
is the output of A and B, and then we’ll go back
one way further and say, okay what A and B and do running
the program according. So here if I call only C and iron see it will be calling
A and B accordingly and give me the final output which is 4 x 6 which were
the constant values of A and B. So this is a very small and toy
example of how the flow works and had there been
multiple functions. After see you would be doing like d is multiplication of C
multiplied by something number. So you just need to run D and all the other functions
will be running automatically. So the second type
is called a placeholder and we will be using extensively
this kind of data type and I really like
this one line is available. I have copy pasted
from tensorflow website itself. And it says the tens
the placeholder is a promise to provide value later and the most simple
and easier example is that most of your features
and label values like your X and y’s would be initiated in your tensorflow program
as placeholder values and what it does is it that you can assign any value. Let’s say you have assigned a is
equal to DF dot placeholder. So we are saying it’s
a placeholder type of data and the type of data is float32. So whatever value I
will be providing in future for a these would be float. B is I’m mentioning that will be flow type
32 and then I can assign a operational
mathematical operation. One more thing. I would like to add
that placeholder objects will always be coming with another thing called
feed dict or feed dictionary. So whenever we have
a placeholder object, we will be giving all the final values
with the help of a dictionary. So people who know what a dictionary type is
in Python would know that we can assign key and value
in this particular format. So here it’s been represented
in Python with curly brackets. And what we are
actually saying is that there is a dictionary
where a is equal to 1 7 and 6 and if you recall this kind
of representation is a list so a is a list of 17 +
6 + B is a list of 320 and to you can think of this that you are assigning
your ex is equal to 1 7 and 6 and Y is equal to 3. 20 and 2 and this is the same
way you will be designing your input as well as
output features of your model. And when you run it, obviously you get this output
1 plus 3 equal to 4 7 plus 20 is equal to 27 6 plus 2
is equal to 8 the benefit of this kind of implementation. Is that your data on which you are developing
your model its shape and size may change anytime. So for example, if I take a very simplistic
approach that initially when you run the model,
you only had three data points, but tomorrow you have got
two more data points for your information. So let’s say you have added
two more information for your data and this
is the new data set in when you run it. So basically you don’t have
to change your code on top. You just you can mention and change the assignment
in your feed dictionary and your operation
can be changed accordingly and that would be
very very helpful when you’d be
running bigger programs. Where you X values and Y values you may be
getting your changing. The number of columns number of
rows in your data set and trying if there is some kind
of problem with your data which you need to change
every time another thing. I wanted to show you that let’s say like
in the previous example that apart from
this loader at a plus b if I assign
any other calculation like I’m saying x 5
and this is the output coming from this previous operation
to multiply it by 5. So I don’t have to run
the intermediated functions. All I can do is I
can run the last function and here I’m kind
of giving the inputs of two dimensional data. So if you see it’s
like three rows and three columns kind of data it will still be doing
its job accordingly third and another most
important variable comes from the prince of flow
called tensorflow variable. And if you see that this is a capital V and
that sometime you may make mistake in writing it
as small we’ve given that On stand and placeholder
start with the lowercase words or letters now first
discuss and understand what a variable May and this definition again
one-liner for variable from the website itself is
that variable allows us to add trainable parameters to a graph what basically it’s
trying to say is that d f dot variable type
of data is the kind of values which you can initially
assign anything but during the program you
can change the values for your purpose and I will be telling
you why we need it and that’s the important stuff
for your learning process. So from the syntactical
point of view, you can assign just like we did for a placeholder
or constant in intensive flow. We can assign a value. Let’s say we assigning it w 1 and the W1 is you
can think wait one. We are assigning TF dot variable and we are saying they
Niche the value is 0.5 and it’s DF not float type. This is the way of assignment but there is one more
syntactical line with this is required to run your code. Then every time you are assigning a variable
type data in your code. You have to mention this line, which is TF dot Global
variable initializer. If you don’t mention this line, your program will
not be assigning value, which is point
5 2 W 1 and in fact, if you run it, let me comment it and
run it without it. He’ll be giving you
a huge error in saying in it is not defined and even though you change it something else it will be giving
you a huge error because for assigning the values you need to run
this particular line of code when you run this what it does it initialize
this TF variable type of data and do the assignment
of .5 2 W 1. So this is only
from the syntactical way that whatever number of d f dot variable type
of assignment you have done in your program end
of the program. You just run this line so that the Mint is complete
for all the variable types and then you can use it the way
you wanted it earlier. So this is a very
simple intuition. So let’s take
an example from regression. My intent here is that I need to explain it
to you the concepts behind the cost function and the optimization but
let’s say this is our X. This is y a very
simplistic approach and this is our actual data. Hopefully you can see my screen. So let me change the color. So this is the data we have and
what is the best regression line which goes through the data set
or does the good prediction because you know that they can be
n number of line which can pass through
I can use this line or they can be this line or they
can be infinite number of lines which can go through. So, how does your algorithm
decides in say that? What is the best line
for your equation and whatever output which came yesterday
after the psychic land. Let’s say it was this. I’m just Up some numbers that intercept value is
point 2 and the coefficient of x is point 3 and this
is the equation of line and probably let me highlight
this another time. Let’s say this was
the green one is the line which your model has selected
but you would agree with me that your model has
tried different lines and after that it has selected. So how does
this process happens? And actually this is core to all our algorithms either
machine learning deep learning or whatever but this process
of one thing called cost and second thing
called optimization these two things are common
across all your algorithms. And if you understand this part
then all you need to do is that you just need to see
what are the variation between these two things
for different algorithms. This thing does not change
for anyone of the algorithms. So, how does it really work? So let me first talk
about only cost. So what it does, let’s say this is the green one
on hovering over. This is the line
and I need to understand what is the cost for it? So what it does it it will say. Okay. This is the predicted value for this and this
is the actual value. The total cost for this particular line Square
sum of all these distances. So whatever the distance was here here all
these distances I take and I take Sigma of all
the distances and distances are nothing why actual – let me say it I and why
I had for a given line so we have understood how do we calculate the cost
and our objective is that whichever lines we are
fitting in for all the lines. All we have to do
is you compute the cost and whichever line is giving me
the least value for this cost. I say this is the final line
and you give the slope and intercept for
that particular line. So now we have understood that how the cost
is being calculated but there is one thing
still being not clear that how does
a system identifies that which is the line and how does it usually
start that line? So as I was saying initially, this is the data all we know is that this is our It is
and we need to find a regression line for it. And just to mention that this process
remains same for both classification regression where the multiple
class single class and the baby will be calculating
cost and optimization. That would be different. But the process remain
exact same let me write what we were looking for cost
and optimization. So cause to be understood
from this toy example that how do we calculate cost but we were at a discussion
on d f dot variable and what this T
of that variable is and why it is important. So let me come back to it that YT F dot variable
is important so we know that we have just one variable
and our equation is going to be Y is equal to intercept
some coefficient plus the slope multiplied by X. And as of now, we don’t know what is the value of intercept
or slope would be so how this process has
been designed for any algorithm. I’m mentioning it. I’m giving emphasis
that any algorithm. Is it will be same
across any one of the algorithm. So how much been done
initially is let’s say I’m representing a for brevity
and for be the slope. So how we initially do it that we assign any value
a as TF dot variable because I know
this its value can change the F dot variable and let’s say
initially I assign it to zero they would be different
intelligent ways of Designing. How do we put the initial value? But let’s say I
have initially said that both the values
of the F dot variable is again 0 so you will agree with me if I have a equal to 0
and b equal to 0 then the predicted line from this equation
would be what my x-axis because both the values are 0 so Y is always 0 only
on this particular line. So what L do that
once I have initialized these values I will calculate
the distance for this line. So basically the prediction
for my model is the bottom line This x axis. This is the prediction coming
from a is equal to Zero and b equal to 0 and then I calculate
the distance from my line and I take distance for all
of them for all the points. I’ll take the distances
and I do calculation for my cost then actual values
of these points and predicted values are
all these 0 values and I take the distance
for each point and sum of squared errors. I do and whatever the cost I
have got this is been fixed now in the next either ation
because there will be I trait if process that I’ll be changing
my values of A and B and I’ll keep on doing
till the time I reach to a line where the distance
between the predicted values and actual values are limited. I’ll be talking about
how these values are changed. So I have done
the cost and let’s say that this particular instance
the sum of distance errors or square error is
hundred so distance from each one of the point
from the line in square and Sons coming hundred. So next step. What I do is I change the value of a and b because of
these are variable types. I can change the value. So what I say
in the next iteration instead of having the value
of a is equal to 0 and b equal to 0 assign them
a is equal to previous Value Plus 1 and B is equal to again
updated by positive one. Now, the a is value is 1 and B is value is 1
then probably this line would be something
like change the values of my A and B and now it’s
bit of positive then probably it would look
something like this because now a let’s say this is
the distance of 1 and again x value is changing by one
and you are adding this up. So now instead of having
this line I end up getting a new line this process
of changing my weight in a direction where the cost is reducing. So let’s say for
this particular line where a is equal to 0
and b equal to 0 cos 100, but when I updated values of a 2-1 B21 cost
has reduced to let’s say 90. So what do I mean is that once we have identified
that we have updated the value of A and B from 0 to 1
the cost has gone down. This process is
called optimization. So these two terms caused
an optimization they work hand-in-hand and the process is that you take any initial
values of A and B. I am initializing with zeros and then you compute the cost
between the difference and then you update
it in the direction. So that cost can be reducing
I’ll be going in details that how does model understand that which direction the way
it needs to be changed whether it needs to be add
1 or subtract one or add to or add 5
that will be going in details, but from the intuition
perspective this process of updating your weight
in the direction so that your cost is minimized in each direction
is called optimization and we’ll be going in different. directions to understand
how does model understand once we have understood
the concept of loss and it’s one of the most important concept
I am saying it again so let’s have a look
we are first showcasing that how you can set up
a linear equation in the model so what we are doing
is suppose this is the equation y equal to B plus WX we need
to design in a model so we are initializing W
and B is the variable type and the values are .34 wn- .94 be and then X
is a placeholder so like we said we will be
providing the value of RX later on we have decided
the linear equation is W X plus b this is the line of code Global variable
initializer to initialize the variable type objects into the model I initialize
my session as DF dot session and then I run in it so that the values of variable
types can be initialized and then I run the linear model
as Says don’t run. I’m running the linear model
with a feed dict. So I’m not mentioning feed dict. If I go back I mentioned that your code
automatically understand that all the values between this curly bracket
our dictionary type. And now what your model
is doing is it’s taking one multiplying it by W which is point 3
plus minus point nine, which is the B value and it’s been done
for all the values of X. So this is a very crude way
of assigning the model but here if you see we are not doing
any weight optimization or we are doing whatever
the value were there. We have initialized. They remain same and whatever the output
came we have taken as it is for addition. We can use t f dot add for subtraction are
TF dot subtract and for multiplication multiply
is the about multiply and one more important thing is for matrix multiplication will
be calling a function call Met. Mul. TF dot M80 mu l so that stand for
matrix multiplication though. It’s not Important but I’ve just put in
there to explain it to you that sometimes your data
is not been defined in the way you want it so TF dot cost is a function in which you can use
to convert your data. So if you see that, this is a float type because I’m taking 2.0 and if I need to convert it
to the integer type I can use this cost function I think
cast is also available in SQL as well. If so DF dot cost and it can convert it
to the integer type. This one is for divided. So we understood some
of the basic building blocks of ends of flow and these
were constant placeholder and variable these activation
functions are kind of mapping functions between input and output and here the input and output are
not those input features and output levels. I mean, but all this input
and output is that if you provide a number
to a function, it kind of does
some mathematical computation on top of it and return some value it is been used in numerous
ways numerous places. And some of these specially the sigmoid 10h real
one softmax the bottom for you will
be using extensively throughout this course. However, these activation
functions are nothing to be scared of these are
just mathematical functions which transform a mathematical
value in some other value. That’s why they
are called transformation functions as well. So first one is
called linear identity. And basically it’s nothing
you can think of it as a no function
been applied though. You will be hearing that linear activation function
is used and basically what does it do is that whatever value you
give to the function if you give functional value
of 0 it will return 0, if you give to it Britain’s to if you give n it returns and so this kind
of line you can think of the whatever value you give on EX the same value
of y is being returned. So it’s kind of identity
function also being called. This has been used
in the output layer of a neural network where we need to predict some kind of numerical
value letter regression kind of problem. We are going to solve and we don’t have
to give the prediction in terms of probabilities. We just need to predict
some real numbers. Then we can use in the output layer this kind
of activation function or you can also think
we are not applying any activation function
as simple as it is, but in some text
it’s been called a linear activation function. So that’s why it’s just good
to know second type is called a threshold function
of threshold unit function or a unit step. What does it really do is it’s
a kind of threshold function and what it does it that if the value of x is
beyond some value here. We are given example of
if 0 is greater than x then 0 1 if x is greater than 0 so that’s like some threshold
we have put in but if you look
at this picture here what we are saying is
and what we are depicting that if the value of x or your input The number
you are inputting into the function is less than equal to 5 then
it returns you see row and if the value is greater
than 5 a Returns the value 1 so it’s kind of unit functions and it’s been used
at some places where you need to set up
some kind of threshold that if the value is greater
than this return me this value if the value is less
than that return me that particular value but it
is not been used extensively throughout the program. Now comes the hero
of activation functions and that’s called sigmoid
or logistic activation function what it does it when you give some number
to this function, it Maps the output value
between 0 and 1. So if you give any value
between minus infinity to plus infinity, it can convert it
between 0 and 1 that how it is helpful in terms of predicting
the probability for an event. So if the value we
are inputting is positive and bigger the output from this function
would be closer to 1 if we have in putting a negative value the output
would be lesser value. For example, X is equal
to 1 then output would be 7.73. If it’s 10, it’s very close to 1
you can also think of just putting some value
into the function and it Maps it
values between zero and one fourth function is
all 10 hyperbolic or 10 H. And what it does is that for a given input value
it transform it into in between Minus one and plus one to something
like logistic function converts the values between 0 and 1. It’s very similar in nature like
sigmoid activation function. The only difference is instead
of converting the values between 0 & 1 it converts
the value between minus 1 and plus 1 and for the value
of 0 the output it zero, however for a value of input
0 the sigmoid returns .5 and you can
easily understand why if we input 0 here then e
to the power 0 would be 1 + 1 by 2 becomes 0.5. So this is the point
when input is 0 as far as implementation is
concerned sigmoid is being used in the output layer so we can predict whether the probability
of event is 10 H is been used do we
haven’t discussed architecture, but it is been used in lot
of mathematical computations in your neural network to derive
the best values of weight. Another activation function
is called re Lou and this has actually transformed
the entire learning process of neural network that’s called rectified
linear units and though it looks very simple that all it’s doing is for all
the values of input or less than equal to 0 it returns zero, but for all the values positive,
it Returns the same value so it become linear for all
the positive values though. It looks very simple. But let me tell you it
has changed the entire Paradigm of neural networks. This is the algorithm with this
is the activation function been used for deriving
the most intelligent weights in neural network softmax
activation function is very similar to sigmoid
from the point that it predicts probability, but it works very well when you have number
of classes more than two. So if you’re working
with more than two classes, it converts them the probability term and I
can see there is a major typo. It should be point zero
1.0 1.0 one because it should be summing up
to Total value of 1 so what it does it that it gives the probability
for each class of an event and then it should
be summing up to one. So let’s say therefore
categories category ABCD like here so we’ll be giving what is the probability
of an event happening in class A let slip .5 here. It’s point three point one and point one and it
would be summing up to 1 So Soft Max is also
been used in binary cases, but it’s major implementation is
when you have classes more than 2 the output
is very similar like sigmoid does but it’s
for more than two classes. So this is a brief introduction
on activation functions. So we are importing
numpy Library as NP you can Define your own functions
this particular way. You can Define
your own custom functions. If you need to do your own
calculations a d e f death or Define is the word
for telling the python that you are writing
your own function. And if I just talked
about the syntax, what we are doing is we
are giving the name of our function
then we are saying that we need to provide an input
and once we do the input what it should return
so return is like what the value it
will be giving back when you have run. It should be giving 1/1 +
n p dot exponential. It means we are calling the exponential
from numpy library and then minus a so whatever value
you give here you put into this particular formula
return the value. I’m doing it for
these many values of a and I’m running
for most of the values. Let me just do it
what it’s doing at first. It’s running the Activation
function for one. So if the value of a is 1 this is the output your sigmoid
activation function there if the value of a is 2
it becomes even bigger. So as I was saying that value when you are in put the number you are giving to
a sigmoid activation function is positive and bigger
it will be closer to 1. So if you see I
have given value of 4, it’s very close to 1.98. When you give five it’s
in fact even further close. However, if you give a negative
input to this function, the output is very close to 0 and this particular way
your model is being able to do predictions between 0 and 1 and if I give
the value of zero, it Returns the value of 0.5. However, these activation
functions are also available in tensorflow where you can just call them rather then you
define this function. So here I have written
10h activation function here real oo. This really is very simple. So you only have to do is you have to return
the maximum between 0 and the number so
if this a is greater than 0 it will return a and
if a is less than zero to ten zero so you see how easy
Ray Lewis implementation is that once you run it? All I’m doing is I
have written the relu function and all I’m doing is
return me the value if I input a is equal to -4, what would be
the rail you of a so if I do it for – for it returns zero for – 288904 positive returns
positive 24 positive for it returns positive 4. So in this particular fashion, you can write all your functions
in same I have done for softmax and sigmoid. So rather than
defining these functions, you can call tensorflow sigmoid
tensorflow 10 Edge tensorflow and indoor tree loose and instant for neural
network intensive look, but for real, you need to call it this way
and It will give the output as expected from the above
formula intensive flow or in this deep learning architecture. It is always a good idea
to initialize your weights with some random numbers and there is a function in tensorflow called
truncated normal. The truncated normal is
that it will be generating value normally distributed
numbers with mean 0 and 1 standard deviation. However, this truncated word
come from that all the values which will be generated
from this normal distribution would be only from plus
minus two standard deviation and not Beyond it truncated
normal is being used so that you don’t have extreme
values from the distribution and all the values you are generating from
this particular distribution are between plus minus two standard
deviation with 0 mean how does this architecture
or mathematical architecture got the name of neural network
in first place. So the intuition for this kind
of architecture of a learning Go to them comes from
biological neurons and people who come from biology background
would already be familiar with this kind of picture, but for others who are not very much familiar
either with deep learning or with biological neurons. That’s how it look like and there are
three major components which have been used in building
the Deep learning architectures. And these three architectures
are these three things are first one is called dendrites and then rides are nothing
but these are signal receptors. So for an example, there are billions
of biological neurons in the human brain and lot of these biological neurons are
attached to our sensory systems. For example, there
would be lot of neurons that ditched to our eyes and when we see something
we receive signals in terms of light or something
and these dendrites what they do is they
receive this signal which is coming from the eye because they’re connected
to the retina or whichever part of the eye which is processing
the information and once these dendrites have
The information this Summit up in the nucleus of this neuron
and that’s called cell body. So if you take the intuition what’s happening is
from these different dendrites information signal
is being received. And then this neuron
is summing up. What is the total information
coming and then the final point or the third point from a biological neuron
is is called an axon and what it does it. It passes the total information which has been received
through this X on to a next neuron, which is connected to it because one neuron
would be connected to a multiple neurons
and they would be lot of neurons when put together and when the information received
from all these signals, which is coming from
their respective cell bodies reaches a particular threshold
your brain take some decisions and this information
is being processed in the form of electric signals
or electric current. This is the same architecture
which in 1950s McCallum and pits came up with first
architecture called perceptron. Which kind of mimics this architecture of receiving
signals summing it up and passing it
to the next level. And from there only we built
the complex architectures in deep learning. And if you look
at this architecture, this is very similar
to a biological neuron from the perspective that we have some inputs x
1 x 2 x 3 and x n w 1 W 2 W 3 and w n these
are the respective weights and what’s really happening is
from these arrows. We are trying to signal that we multiply
respective input. So x 1 x w 1 x 2 x W2 x 3 W 3y and here
in the cell body. We are summing
up the information. So this is like in
the previous picture. We are calling
the cell body here. We are calculating
the total information coming from our inputs
with the respective weights in activation functions nothing
but mapping functions and what these are doing here that if this information which has been coming
from the inputs with the respective age if it is beyond
a Equal to threshold. Let’s say you are applying
a threshold activation function. Then your model would
be outputting 1 or 0. So for example, let’s say if the
output is positive and it’s 10 coming
from these information and you have put in
a threshold activation function here something like this. If this output is greater than equal to 10
then pass one else 0 let’s say this is
the function you have applied and given this information. All your activation function would be doing is it
would be looking at this output how much this outputted and accordingly it
will be passing the value given this kind
of a threshold function which I have written here. It is greater than equal to 10. Yes. So the output would be
1 this is the process how I perceptron
produces an output. So here we are just showcasing that this summation is
nothing but the input with the respective weights, so here it’s showing
wi and multiplication of X High summation
of this and for bias term, how is Being done is before I go further
that for bias term. What we do is we
introduce a column or a feature in our model. Let’s say x is 0 and all the values
and x0 you say that there are all ones and then we also introduced if wait called W
0 which is nothing but what used to be
a bias term and in fact, how do we do actual
implementation here? It’s showing wixi
I starting from one. So if you just change this I equal to 0 then you
can easily do multiplication because I X 0 is already one and you’re just multiplying the bias term
in the actual implementation. You will see that we will be doing
from I equal to 0 so we are adding a bias term and for
all the inputs in the model, we are multiplying
with the respective weights. So that’s how we can make it simpler from the
implementation point of view. So we will be calling
W 0 as a bias term. So here that we have just introduced
another column for x0, which have all the values
who are fun. If I have a column x
0 or feature which have all the values at every cell for all
the training examples has one and then if I am x w 0 which is a bias term,
you can easily understand. I’m just getting
the bias term for it. So given this understanding that how does a perceptron
learn so now let’s come to an example of a perceptron. So for an example, if you look at this class
and if you want to discuss this intuitively, what we do is that this is the actual data
where we have three pictures of dogs and we have
three pictures of horses. And what we want is we want
to create a linear classifier which can separate
these two classes. So how would your model
would be doing is as I was saying that initially these values of w 02 WN would be
Initialize randomly and this line would be created
based on these random and then you see what is the prediction and how many images
have been predicted accurately and how many have been
predicted in accurately and that’s in accuracy
is called error term. Let’s say you have fit
in this first line from the random weights and you see this horse
has been protected as a dog and this dog has been
predicted as a horse. It means there are
two misclassifications or number of errors are to then in the next I tration
your model would be learning that in which direction
your coefficient or weight should be changed so that this error goes down
and next line is Being Fit is this so you have changed
the coefficient values and this line get transformed
into this line. And then you see again that
how many errors are happening and this time only one error because this dog has been
classified as a horse and number of Errors is one and then the same I’d rate
of the next Hydration, you improve the weights
so that there goes down and then your error
has been reduced to zero and this is the line which your model
should be fitting and we have seen that
through the logistic regression. We also fit a line in the first module
in the learning process. We didn’t go through half as the logistic regression
learns or line. We just called a function
from scikit-learn and this function
fit this line for us. But here we will be learning that how does
your algorithm learn that this is an optimal weight
and we will be going through the different
building blocks of codes which help your model understand that this is the best
line for the model which is giving me
the least number of Errors. Let’s take a
very simple example, all of us should be familiar with and and or Gates and these
are you can think of it. It’s a kind of classification
problem these two gates. So if I look at the
or gate you will see that if any one of the gates
are open the output is 1 and if both are closed And output is 0 and you
can also read it this with it if x 1 and x 2 both are equal
to 0 the output is 0 and if anyone of either X1 or X2
is equal to 1 the output is 1 so you can think of its kind
of a classification problem and same as end gate
where the output would be 1 only if x 1 equal to 1
and x 2 equal to 1 and then rest three scenarios
the output is 0 and let’s see how we can use a simple perceptron
to solve this problem. So here if you see we have input and output and all
we need to do is that how does your perceptron
would be helping but if you want to see the process of
how does an algorithm understand what should be the line of
separation between two classes so we can think that there are two classes
one is y equal to 0 and y equal to 1 and we
can present it this way. So this particular point if you see here X1 equal to 0
and x 2 equal to 0 and output is 0 we Showcase it to like known filled Circle
and for all other three either X1 is 0 or X to 0 in these two points
and this particular point, which I am hovering
over now here X1 equal to 1 and x 2 is also equal to 1
and now my objective is that I should be able
to find the line which can separate
these two classes. And let’s see. How does your model
can fit this line for the model so we know
what the inputs are we know x 1 and x 2 values and let’s say
a model already have learnt that weight associated with X1
is W 1 which is equal to 1 and W 2 which is weight
associated with x 2 is also equal to 1
and here we are considering that let’s say your bias term which was B is equal to 0 just
for example purposes. So if you look
at the first example if I want to do output
for this section, how do we get the output that we get the summation
of all the inputs with the respective inputs? If you see x 1 multiplied
by W 1 so x 1 value is 0 and W 1 value is 1 so this section becomes 0
plus again for second term, which is X 2 X W2 because this is 0 this term
again becomes zero and the total output
for the first example is 0 now. We have put in
a threshold activation function where we have fit in such a way if the output coming from this is greater than 0.5
then give me the output 1 L 0 and because the output
for the first example, which is here is coming 0
and 0 is less than Point 5, which is the threshold
your model would be giving the output of zero
the same process. If I do for the second example x
1 multiplied by W 1 would be 0 because x 1 is 0
plus 4 II X 2 X W2 As with this is 1 and this is 1 the output is 1
and then I again pass through this one
with activation function which is point 5 and 1
is greater than 0.5. Then your model
would be outputting one. So that’s how for each one
of the examples you have model would be multiplying the inputs
with the respective weights and with the help of
threshold activation function. You would be measuring what output should be given and your model would
be able to classify in which cases your model
should be outputting one in which case it should
be outputting 0 Let’s look what is the process really look
like and whenever saying trying to draw different line if this is the case where there are some blue dots
in there some green dots there can be infinite number of lines which can pass through
we are showcasing three, but any number of line
can pass through and remodels objective should be that it should choose the line
which is the best fitted means which is giving me
the least errors in terms of prediction. So how does it learn
there are few steps involved. So we have some inputs
and I’m talking about scenario where we need to produce
some kind of classes. Let’s say you need to predict whether something
will happen or not. So you have some inputs
in terms of your features X1 X2 X and features and you
have your dependent variable or class variable y. So, how do you do that? First? You will initialize
these weights and the threshold or any activation
function you can say that I am putting all my weight
is equal to 0 or you can say that I am randomly generating
numbers for my way. Between zero mean and one standard deviation then
given this randomize weights you have inputs in terms of x’s and also the output feature and then you make predictions
after multiplication of respective weights
with respect to features and applying the activation
function and then compare how good your model is doing. So how would you model
is doing you compare if the actual value
was one your model was predicting one or not. So this is the misclassification
error you calculate and then this equation will look
very confusing and intimidating in the beginning but it is not very difficult. We have calculated the error an error would be
difference between actual – predicted values. So this error we
will be calculating and then we will be updating
our weights in a way. So this this error get minimized
and this equation is used in terms of updating
your Weights in the direction so that your error term
is reduced at each eye tration. This equation would be used in updating all of them
in the direction so that this error
term is minimized. What really happens is that bait at the next I duration
and J is like, you know either W 1 W 2 W 3 J
is just representing that wait wait
at the next I tration would be weight
which was earlier plus ETA actually eat ice
called The Learning rate. And then this is the error term
which is the desired – actual and then
multiplied by the input. But basically your weights would
be updating in the direction so that the total error
from the model is minimized and this process between 2 & 3
would be repeated till the time either the error
has stagnated or also, you can repeat this process for
a fixed number of iterations. Can say that? Okay repeat this learning
process 400 I trations. This is an overall process. And actually this
does not change for anyone of the learning algorithms
either you’re talking about machine learning algorithm or you’re talking
about a deep learning algorithm or you’re talking
about unsupervised actually the process remain exactly the same in which you
initialize some weight. You have your inputs. You measure how good
your model is doing and how good it
in terms of errors and with the help of error you
keep on improving these weights so that the error is minimized
and this process as I said is common across all your machine learning
and deep learning algorithms. So this is as I think
it is the process which all algorithms follow
in terms of learning. Let me give you an introduction of two new terms first thing
is called learning rate. What is in learning
rate learning rate is an amount of size your weight should
be changing and let me give you an Div understanding, what does it really mean? Anyway to W in the next
iteration would be changed in such a way that from this equation
E. Te this is some value you will be finding
from some process. Let’s hit this is
some constant term 10 if I change the value of this eita the speed at which
my W1 would be changing it. I tration would be different
and what I mean if I keep ETA at Point 1 then in the next I tration
my weight would be adjusted with Point 1 multiplied by 10. It means equal to 1 so in the next iteration my weight
would be whatever weight it was and I’m adding this term which is equal to 1 however, if I put ETA
which is again a value which you would be providing
a model if I fix that ETA is equal to point zero 1 then what is
the total output from because this is you can think
of it as a constant 10 then multiplication is point
1 in this particular way. The next I tration your W1 has
been changed only by point one. So your ETA is the step size or the step your weight would
be taking in terms of changing. If you keep this ETA High then your weight would be
changing more frequently and more aggressively. If you keep this
slow your weight would be taking smaller steps
in updating them though. Both of them in going
in the right direction. Let’s say we need to update it. But this particular way when we keep the ET
8.1 your weight is changing one at every step. However, if I keep it very
simple your weights are updating with it smaller steps and Eda is been used
in terms of controlling how fast and slow your model
would be learning. Another thing is a pox so till now we have been
discussing these words called titration So
when you say hydration that I’m running
it thousand times, so basically I am saying that I’m running
thousand iterations of it. So the weight adjustment
And we’ll be happening through learning from all
the training examples. Let’s say you have a hundred
training examples in your data and the errors you
will be calculating from your initialize weights
on all these hundred examples. So like we have seen
in logistic regression. We saw what is
the accuracy percentage and accuracy percentage is
again one of the ways of seeing how good your model is doing. So this learning process is taking all these
hundred examples going through all of them to change your weights
in the right direction. So a pox means that how many times you go
through all these hundredweights in the learning process. So you will be deciding that you need to run
your optimization algorithm for how many I duration so let’s say
you say I want to run it for 1,000 times. So your model would be going through training examples 1,000
times and changing the weights and after 1000 iterations. You will be saying
what is The final weights and this hydration
is called an epic. So this is the IPython notebook
and let me Zoom it a bit if I can. So first thing which we need to learn here
is called the loss function but the loss
function is basically if I take intuition path, it is an indication that how good
your model is doing if there is loss of loss. It means the model
is not doing good. If it’s accurate
or hundred percent accurate, there is no difference
between actual value and the predicted web. So this is the definition which
comes from tensorflow websites, like a loss function measures how far apart the current model
is from the provided data. There are few Matrix, which are being used
in measuring how much the loss is and it’ll depend
the kind of problem. You’re going to
solve for example, if you are working
with regression problem where you are outputting
some kind of numerical values MSC mean squared error is
something loss function, which is So we’ve
been used though. You can also use MSC in terms
of classification problems where you take 1 or 0 as your output value
and the probabilities, let’s say you are using sigmoid which would be
between zero and one and you take the difference
between actual value and the predicted value though. It is possible
and you can use it but there is another loss
function being called cross entropy entropy is
a loss of information. If you talk about
in Signal terms and cross entropy is
in loss function, which is been extensively
being used for most of the classification problems. So let’s take
a toy example here. We are trying to see
that what is the loss if you see that we are importing
the tensorflow stf. We are initializing
two variables and we are just fixing the value for now
Point 3 and minus Point 3 4 W and our bias term we
are initializing X and Y and what we are doing is
we are writing an equation which is like Output is rate
multiplied by the input feature plus the bias term
and square Delta, which I’m calculating here. Actually, this is the difference
between the actual value which is y
and the predicted value which is the linear model and TF naught square
is helping me taking the square of each difference why
we’re doing Square. It’s the two reasons
we take square one. We will be seeing
at the time of optimization that we take some derivatives. So taking derivative
Office Square term is pretty much simple and in comparison
to absolute value and second. They may be a difference
between actual value and predicted value
which may be, you know, some case it will be negative
for some example du+ and if I submit a for all
the examples this difference, it may come to 0 you
may have this understanding that your model
is doing a great job, which is not and the loss
is you are taking squared. Sum. So here we have taken
for all individual values and reduce some actually. Fluids taking summation
for all the examples we take and then all the values
coming from here. We’re taking the square
this line if you recall was for initializing all
the variable data type in the model. So Global variables initializer help me initializing
the values of w and B, and then you run a session and then you need to provide
these values with the help of feed dictionary and I’m saying X is equal
to 1 2 3 4 and Y is equal to 0 minus 1 2 minus 3,
and then you run. So what your model is doing? It is producing the output
with the help of input which is X and the weights
and bias term which I have initiated
on top Point 3 and minus 1/3. It’s multiplying the weight with all these values
then adding the bias term and then we are taking
difference for all the four examples we had
and this is the total loss. We are getting from this model. So loss function is
kind of indication that how good and bad your model is doing now comes
the process of optimization. And this is the core
of learning one way to calculate the difference
between actual and predicted if the variable you’re trying to
predict as a numerical variable like in the case of regression, you can take help
of the square difference algorithm and it
will be useful in predicting how good a model is doing. So it’s the square
difference but in case when we have categorical values, we use a different function
power different loss function and most of the time we use
cross entropy and let’s take the cross entropy first. So I have created a very simple and with a very small example
just to give you an intuition that how does your model learn how much the cost
is using cross entropy? So I’m taking an example. Let’s say you have two features. X1 and X2 do it’s the same values but let’s say
I’m just calling it x and x 1 and there are three classes
in your output variable class 1 2 3 and there are
seven customer data you have so because I know there are
three distinct classes in my dependent variable. I can represent it with one hot
and coding something like this that I can represent class 1 as 1 0 0 class to
its 010 class 3 0 0 1 and this thing would be
repeating so class 1 again 1 0 0 and so on and so forth
for all the examples we have now just this section you think
that we have got some output. So if we have two features
or two variables x and x 1 then there would be
two weights associated with them and we will be producing
some kind of output. So let’s say here
the weight was W and here the way it was W1. So these are
two weights respective to each variable and whatever
the multiplication is coming w. X xw 1 multiplied by x 1 this is the output
you are getting then if I have got these outputs and given these are
representation one hot and coding at the time when I was explaining
the activation functions. I mentioned that softmax is
an activation function what it does it it
converts the output coming for each class in doing
very nice probability term in the same way. We have sigmoid
activation function. So for an example
if I just take this example, which is highlighted here that for customer one
for three classes. This is the total score came in
and what is the function if for softmax was that you exponential
of the value and divide by summation
of all the exponential values. So what I have done is I
have created exponential of all these values
individual values exponential of 3 exponential
of five exponential F2. And then I Converted in terms
of the probability term. So this 20 divided by the sum
of all three values. And now if you see
the summation should be one for all the three classes and basically what
your soft Max tells you that wherever the
highest probability is. Your model is predicting
that particular class. And in this example, which I’m highlighting
here actual value was this it belong to class one. However, if you see
our model is predicting highest probability for Class 2, so probability for
the second class is highest and the model output
would be actually class to but our actual purpose of discussing this particular
example was the loss function and the function
for this is very simple. It is summation of Y
multiplied by log of p and if you see this function
closely when I say so y multiplied by log of p
and p is the probability for that particular class then
we need to do multiplication. Only for the instance where y equal to 1 rest
of the things it become automatically 0 because y becomes 0 let
me take a step again. We were presenting
the first example as class 1 and these three is 100 if you look here all
we are doing is value for only first class. It is coming .94
and rest of them. It is 0 given for
these Class 2 and class 3. This is already being
represented by 0 so crash on trapeze also a very
efficient method medical operation for identifying
the cost only measuring where Y is equal to 1
and intuition terms. What we are doing is that your model
one was expecting one you predicted point one one. What is the total cost? And the function is if the output for
the particular class is closer to 1 and you know that log of one is what 0 then
the loss for that particular. Example would be 0 however
here we we expected one which is the actual value. Your model is giving Point 1 1
so this is the loss if you see the function
in end is nothing but with a negative
in the beginning E, 3 multiplied by log of Q 3 and Q
3 is the probability coming from the softmax function
and we will be calculating. If you see that we are getting
for second example, it was class 2 and 4 class to the probability is point
2 4 and this is the loss we are getting let
me see for one of those where the loss is
very close to 0. So if you look at this
we were expecting it was the class to example. However, the probability is very
very close to one almost 1 given that all other values
are very negligible and loss becomes 0. So what cross entropy
does is if your model is predicting close to one
for a particular example, then it’s very less. But if you predict
lesser probability the loss. Hired an end of the day. All you do is for calculating
loss of the entire training data set to sum up all the values from these cross-entropy terms
there are different in other ways of calculating
cost and loss function. However, the MS
for numerical outputs and cross entropy for
classification output is most of the time being used
as a loss function. So optimization processes
that you update those weights which are associated
with input values in a way that your loss is getting
reduced and reduce so you would have seen
this kind of implementation and explanation from
people like Andrew NG and Geoffrey Hinton and if you’re reading
some machine learning book, they also have taken
the similar intuition. What we have done is we
have created more tutorial only but think of it it’s an example that somebody is no
on top of a hill and he wants to come
down on the ground. Devon and he’s blindfolded and only thing he has he has
a stick like blind people have but he wants to
come down directly and the only objective he has that he should be able to come
down as quickly as possible. So how he can decide that which is the correct way
or right way of going downwards. Let’s sit there for ways. You can go down he
can take this path. He can take this path. He can take this path. There are other ways which would be the other side
of the mountain. So what this person can do is from his take he
can just step around and see whether he’s going downwards or it’s a flat
in a particular direction and if you see
intentionally we have kept that the direction
of the slope is difference if you see this is a kind
of more steeper in comparison to this this slope, which I’m hovering over so if he Taps around
in four directions, he can see okay, which is the more steeper one and if he takes
the steeper path, he would be able
to reach the ground more. Click the same intuition is being taken by
your learning algorithm. And this person wants
to reach the ground level as soon as possible. Your model subjective is also that you should be able
to reduce the loss or cost term as quickly as
possible and how you can do is if you improve the weight
in the best possible Direction, then you’d be able to reduce
the loss more quickly. And if I take this intuition, let’s say he Taps
around and see the okay. This is the direction which he’ll take
him very quickly downwards then another intuition that how far he can go
whether he can, you know, take 2 meter step or he can take one meter step
or just half a meter step that would be controlled by as we saw in terms
of it learning rate. So how big a step he’ll be
taking they’ll be decided by the Learning rate in algorithm. So let’s say he can take
this bigger path in one instance and he has seen okay. This is the more steeper one. So in one step he
reaches here and then what he does He Taps
around again and see which is the steeper Direction and then take one more jump or
one more step going downwards. Then he steps around
and then see which is this direction
he needs to move in and then he takes another way when I say he Taps around and
see which direction he moves. It is been decided
by the algorithm and which will be using which
is called gradient gradient is first order differentiation, which help you understand then
in which direction your function should be moving
to minimize the cost at each step it keep on
doing the stepping around and then he reaches the bottom
the same intuition is been taken by your learning algorithm, which is called
a gradient descent. And when you say
reaches the ground and our algorithmic terms, we can say that we have stopped reducing the loss function R
loss function has been reduced to its minimum and it’s not reducing any
further this minimum value Where You Are Function and got minimum
value is called Global Minima. So intensive flow for doing this stuff is called
gradient descent Optimizer and there are
a few variants of it. It is the most
basic optimization algorithm for helping
your model understand that which direction
your weight should be updating. We have initially
started W with .5 and be at Point 1. So this is the initial value
you can think of that you have randomly
assigned value of 0.5 to W and point 1 to B, and we have created
two placeholders for our input and output feature and we have created a function
for calculating the output for a given example. This loss function is the same like we have used
in the last explanation. We are taking the difference
between actual value and the predicted value
and taking square of it. And then we are calculating
the total loss by taking some of and now intensive flow to Loss and updating these weights
in the direction so that these
weights are minimized. I’ll be calling an optimization
function called gradient descent and I can easily Call
It by TF not trained or gradient descent Optimizer. So this is the way you call
the gradient descent Optimizer this particular value if you see here this is
the learning rate value that how big the step
your model would be taking in terms of updating. So whatever value
initially starting here how much it would be changing. So step size would be decided
by this then you need to train your algorithm
and what you need to do. You need to call this
optimization function though. It can be done in one line, but I have put it into lines you
could have put it minimize. This loss here itself. But what we are doing
is we are creating an optimization model
or a function. We are calling and then
what Our intention is that we should be able
to minimize the loss function. Which is coming from here, which is sum of squared distance
between actual value and predicted and how we do it. We do it by calling
this optimization function, which is a gradient descent. Then we are calling this function for
initializing the variables. So we initializing .5 and .1 starting a session and we are giving the value
of input and output and then I am printing the value of w and v in this is
the most important thing which you need to think
of these two lines for I equal to 1
in the range of three. So what we’ll be doing will be
running this model 3 times 0 because but this kind
of line what it does is If I do it’ll be actually all
we’re doing with running it for three times. But if I do it print so all I’m doing it
is producing three. So we all we are doing is
for every X we want it just run the train function
on these training examples, which we have end
when it’s done it because it’s out of the loop. What will be proud putting
is whatever the learnt weight of w and B are and if you see that initially
the value of w 0.5 and beaver point one and then it
has been changed to W8 s change to minus Point 5 4 and B is value has been changed
to minus point two four and if we run it
for more number of iterations, it will be changing further. So if you see the loss
when you had weight at Point 1 this was the laws
when you have point to this was the laws when you had
this this you keep on seeing the and at w equal to 1
your loss was Zero. But if you increase it again 2.1 and keep on increasing your loss
is again keep on increasing because now your predictions
are again going in the opposite direction how your model see it
actually something like this. So initially that was
your total loss what your model does
this gradient descent. It takes a derivative at this particular instance
and see which direction it should move and it is okay. This is the direction
you should move so you need to increase
the weight and increase the weight by how much one
would be the gradient value which would be
coming from derivative. But also what you are
multiplying the learning rate. So let’s say you have fixed
the learning rate of point to or whatever number
you have fixed. So this is the step size
you keep on taking so from here you have reached
this particular value and then at this particular
point you again take the gradient and see which direction your weight
should be moving your learning rate help you how much the steps
right would be and then it changes
here and each step. You keep on
calculating the loss. All the cost of it when it reaches
their your gradient become horizontal and your model
will not be changing the weight and you can say a model
has converged why I have shown you both the ways that it is a perfect possibility that rather than initialing
your weights at Point 1 you could have started
with weight equal to 2 and then your model
does the same process. It measures the gradient see
which direction it needs to move and then keep on going
in the direction where the loss is going down. Basically in the text. You won’t see
both the directions most of the time you get to see
half of the picture. This would be kind
of your loss function when you’ll be actually
implementing a model that you start from a place
in your gradient descent algorithm would be helping
which direction is to move and with the help
of learning rate. You’ll be deciding
how would be the step size and you keep on going
in the direction and once your model has reached or model has kind of
identify the weights where the loss has become
almost Most negligible or very very less you have this line going
something like this, but this process is applicable because as I was saying you
are supposed to randomly initialize weights and is
it’s perfect possibility that rather than starting
the weights from point 1 you could have started
from as a weight of two or three whatever weights are there then even
in any one of those situations the learning process
remains absolutely same so there was the reason I
wanted to show you both the size but this is what happens
in the back end. So initially we started
with weights of .5 and point one and we created a loss function as sum
of squared differences. Then we called
a gradient descent and what gradient descent is
actually doing is it’s changing the weights from these values which we have initially
given to our model. These are been changed
in the direction so that the loss is going
down further and further. So this is the way your model
learns and loss function and optimization function. These are two most important. Building blocks and this is one of the very important stuff
in machine learning and deep learning so
minimum value in any function. If I look at this is
the function I have if you look I can start
my values from anywhere and I can go to Value anywhere. This would be my type
of function and people who come from mathematics
background would know that this is a convex function because it is squared
difference between values. We always get
this kind of function. You can change the W
in any direction you will end up getting this kind of graph
only there is no other way that you will be getting
any other kind of graph and people who know it
it’s quadratic equations have given that this kind of function has only
one minimum value you start from any direction you can start from plus infinity
to minus infinity anywhere and keep on changing
the weights there would be only one minimum value
in this kind of function. So convex functions always
have just one minimum value and given it has
only one minimum very that’s called global. Because this is the minimum
value of your function itself. And if you are lost function or cost function f
of quadratic function nature, there can only be of one value. And once you reach this value, you can run your model
for thousands of I trations. This weights will
not be changing because the gradients has become
parallel and they would be no change in your weight. But actually this
is not always true because you will be dealing
with different kind of cost functions. I have used
the squared one here, but let’s say you are using
the absolute difference one as your cost function. Then there is a possibility that you end up getting
this kind of slope or this kind of graph
for a cost function. And when we say local Minima
this particular point, which I’m hovering over now, it is let’s say you have reached
this particular place while learning the example and
if you take the derivative here or the gradient here it will
again be parallel to the axis. So your model may be thinking that you have reached
only the minimum value possible for your cost and it stopped updating
your function or your weights and all these values here
are called local Minima. And if in
this particular function, this is actually
the global minimum because this is the most minimum value possible
for this particular equation. So if you using kind of
Squire activation cost function, then you don’t have to worry
about the local minimum, but you should be aware
about these things that they may be a possibility
for using different kind of loss functions. You may have local Minima
in your cost function and there are different ways
of dealing with it. Like you initialize a weight
most randomly as possible. So there is less probability
of getting at a specific place or there are other methods that you can change the type
of loss function using. So now what we’ll be doing is
I’ll show you an implementation in tensorflow for one
of the use cases, which We have already
implemented in the case of logistic regression using the same data set
which we had earlier. So we are importing
these libraries and these are the same basic libraries
only one Library, which is new is
this tensorflow Library rest of the stuff is same then we are reading the data
the sonar CSV data which we had so we had values of Y is
something like this and now we have created
one hot and Coatings 40. We are presenting it 1-0 and 41. It is 0-1 and 401. So given this next we are coming
to the normalization process. It was important in machine machine learning
to normalize the features to either with a zero mean and one standard deviation
or between 0 & 1 with Max men or any other normalization, but you need to bring
all your input features to one scale and this is one
of the Important step which needs to be taken before you implement any one
of the deep learning algorithms if your features
are not normalized your model will learn one. It may take a lot of time and also it will be taking a lot
of time to converge when I say converge
like to identify the most optimal weights here. We are doing normalization
the normalization process. It’s like taking the value
subtracting the mean of it and dividing it
by the standard deviation of the data and what it does it
will be normalizing the features and we have done it
for all the features in one go we will be calling
this function to do this task appending the bias term. So what we do is
we introduce a column as I was saying that we introduced a column with all the values as 1
so it will be something like we have introduced a column
of this same length as other features are and here the values
would be all the places one. And as I was explaining
the purpose of doing this is so that we can do this
implementation pretty quickly because now we can do W 0 With X 0 and W 0
as I was saying earlier, we will be introducing
as a bias term which we had as B and now because we will be doing
the matrix multiplication. We can easily do this summation from I equal to 0 to n n is
the number of features you have and you can do is wixi and this is for plotting
the values we have in the data set so that we can see how our two classes
one and zero is be in intermingle with each other whether these are
linearly separable and that’s one process
you can use it to see whether you should be able
to use a logistic regression for a given problem or you
need a more complex algorithm. And if you see that all these values
between the Red Dot and the crosses they
are kind of intermingled and there is no one line which can separate those. So either we need
a complex algorithm or people who are very much inclined towards using
machine learning algorithms. You need to do kind
of data feature. And so that those features
become linearly separable, but here we are
using neural network, which would be helping
us separating these two classes. That’s how your
x-values look like. So these are all the features
we have and if you see that these are the values of our y variable
these are all ones and that’s the reason why we
need to shuffle our data set and we do it because that they may be a case
of something like this where all the values are
on top or of y equal to one examples and all the bottom examples are of y equal to 0 so
shuffling this with kind of distribution equally that we don’t have
this kind of distribution and then we are dividing
our data into two groups as we have already seen
in the regression and logistic regression that we are splitting the data
into two groups train and test so 80%
for training and 20% for test site random state is like so that you
receive the same examples at the same time, and these are the shape that 165 is Number of rows and 15 n number
of columns you have and in the training and this
is your training why so now if you see it’s too because every y value
is being a dummy coded so 0 as I was saying 1 0
and 1 as 0 0 1 so first we initialize some
of the hyper parameters and as I said, we are just
introduced to will be when we go through we’ll be
learning more of hyperparameters and we have introduced
to only for a reason that only these two hyper
parameters are required. So first thing learning
rate learning rate is we have understood now
it is the step size or the amount of change in each iteration your weight
would be dependent on it was something
like that wait at t plus 1 iteration would be equal
to wait at iteration T minus Alpha aurita and which was
like both the learning rate and multiplied by whatever
the gradient is coming. So that’s how a way
Get updated but your alpha or learning rate plays a role that how quickly a wage
would be changing. We should always be putting the maximum value possible
for learning rate. So that your weights been learn
as quickly as possible and it’s not a bad assumption
or bad thinking but there is a reason that we always be striving for the optimal value
of our learning rate. And now let me take
an example to show you. So let’s say here we
have taken some specified value of learning date, but it is possible
that suppose we started from here and you have taken
a very big learning rate. So you have increased
the learning rate to a very large number and you started from here
on in were nitration rather than reaching here. You have reached let’s say here and in one iteration you
have changed the weight from point 1 to point 7 and our loss has also
drastically reduce earlier. It was around 300
and now have reached which is almost close to 0. But it is not the
optimal value right? There would be
some gradient here as well. And if you are value of a learning rate is
very high in the next one or the next step you
will be reaching here because your learning rate is high and then you
have reached here. You take the gradient. Okay. Now you need to reduce and probably next time you
reach here and this process which is like keep on hovering over you may reach
the minimum point then it will stop but there
is again a possibility that you kind of oscillating
between the minimum value or the optimal value
in the non optimal values and you may not be moving
in the direction where your weights
get optimized. Finally. This kind of problem is called
Model Divergence in the text and some of you may be thinking that it’s not a
very common problem. I can tell you with experience that this is one of the most common problems
in making your own models work and there are different
ways of deciding what should be the learning rate as a rule of thumb
you would have seen that people would be using
somewhere between Point one to point zero one
and it’s variants that some people
would be using .05 and sometime it’s
like point six or something. So this is the actual range
people use for learning rate and it’s not a bad selection
actually Point 1 to point 0 0 1 so we have already seen that if we have a very big
learning rate we end up getting into this kind of scenario that we never reach
the minimum value of our loss function. We keep on oscillating
and we can see it whether you’re learning
rate is good or bad because we will be storing in the example. I’ll also show you that why we
store the loss of your model at each other Asian
or each a pox to be precise so that you can see that your model
is really learning for its kind of oscillating in
between your loss functions. Trend would be helpful in seeing whether it’s doing
good job or not. Okay. Now I know that bigger learning rate
may create this kind of problem. Why not choose
a very small value. Let’s say select this
as a learning rate. And in this way, I know that I’ll
be taking smaller part. I will be taking more time, but actually eventually
I’ll reach the minimum value but I also said in most of the real implementations
you will be setting up that how many times
you need to run your model and let’s say you
have said thousand why we do it because when you start
doing your calculations or implementations on GPU, you have to be really conscious
about how much time your model is taking to learn and if it takes too much of time you end up
spending a lot of money which is not worth and if you have chosen a very
small learning rate your model would be taking very very
tiny steps in kind of do it’s going towards the minimum
value of your function, but it’s taking very merry stop
very simple short steps and you end up taking lot
of time in reaching their and let’s say these are
thousand step in your model has reached only here and you
end up using these weights. So these are two reasons that you have to identify
the optimal learning rate. And that’s one of the most
important hyperparameters. As for learning the model
then this is training a box and this is the number
of times a number of iterations. You need to run your algorithm
for so we are seeing run it for thousand times and here I am setting
up an empty array and this would be helpful. As I said that we will be measuring cost
at each iteration to see whether your model
is doing good stuff and learning and how do we
know it’s doing good stuff that your cost would be reducing
at each eye tration or epic to see that your model is learning the weights in the correct
direction number of a pox. I also want to mention
one more thing some similar to learning rate. If you set up too many, let’s say you set up
this big a number of learning a pox then your model we
taking too much of time and if you take two less
like the say 5 or 10, then you will not be able
to reach the optimal value because that you have reached
only lets say this position and this is not the optimal
for your model. So we need to identify actually you need to play around
with these numbers so that you take optimal steps so that you Reach a minimum
value with minimum number of epochs or iterations and it can be very helpful
from the point of view that your costs functions graph
would be helpful in kind of telling you how good you are doing in terms of both these values
as a rule of thumb we should start with
a reasonable good learning rate, which is point one
and you should start with like hundred or two hundred training a box more
not more than that and only after looking
at the cost function and graph I’ll show you that how do you interpret it? Whether you need
a bigger learning rate or whether you need to need
more air box you can take a lot of help from graph
of the cost history. So now we are putting what are the dimensions
of your input shape. So you’re getting
number of columns. If you see one is four columns and 0 was Furrow and number
of classes are too because we have two classes
now 01 being presented into two categories 0 1 and 1 0 because we
have done one hot encoding you need to Specify the x value
and that’s what I’m saying. It is very important that you need to
specify the dimensions. We are creating X
as the placeholder like we have already done
for smaller examples, which would be a float 32 types, which we have
four decimal values. This none means that it can be
of any number of rows. So in this particular data set, we are training our model
we have let’s say 159 or 200 training examples, but tomorrow you have 300 so you don’t have
to change this Dimension. If you fix it,
then you have to specify that. How many rows are there, but if you keep it none then
your algorithm understands that it is changeable number and whatever number
of a rose I get just accept it as it is and number
of Dimensions you’re telling that how many input features
you have into the model though. We are deviating on top of we have specified
column for our biases, but we are doing the simple
implementation as earlier, but the more important point
to take away here is this we are initializing. With all the zeros and let
me tell you it is not a good implementation. But here we are initializing
all the initial weight at zeros and that would be of what dimension right need
to do matrix multiplication. These two Matrix has to be
some kind of specific dimensions and if I need to do
matrix multiplication, let’s say this is n by m what should be
the dimension of this Matrix at least M. And then I may have
any number here. If number of columns of the first Matrix
is equivalent to the number of rows of the second Matrix then only you can do
matrix multiplication. So that’s why you
have to make sure that weights you
are initiating should be of the correct dimensions and that’s why most
of the people make mistakes if you interchange the numbers if you see that this is
the number of columns in this is the number of rows. So that’s why n
if you make mistake here, you will be getting
lot of Errors. Then I’m shooting
this bias term with all zeros with number of classes there
to number of classes. So one bias term for each class. We are just Using all
the variables in the court and defining the Y variable
the output here. So we are calling
the softmax function. And what we are doing is
the predicted value is d f dot met mul of X
and W plus the bias term and the softmax what it is doing is so
this is the place holder for your actual values
and your actual values. We are doing the same place
holder the F32 none is again because if you see here
for X variables, we are saying that it can be any number
and you would agree that we need to have
as many number of rows for input features as
many dependent variable because one would be
corresponding to each other and number of classes there
to number of classes now, so we are doing
two operations first. We are calculating
the total output which are showcasing
in Excel file that what is the total output for each class. And then I am generating
probability using the t f dot n +. Softmax function. So now we are
doing this calculation and we are implementing
the cross entropy though. There is a function
itself as cross-entropy. But here we are
trying to Showcase. We are multiplying y
with the log of the probability. We are getting for each class and you end up getting
number only for the class where y equal to 1
and then we are taking this negative and reduce
some production indices. If you call one is by columns, we are summing it up
for all the columns and then taking the mean
of all the value. So this is the total cost and then we are calling
the train Optimizer. We could have done the
in the deer example in one line of code. So we are calling
gradient descent Optimizer with the learning rate. We have already
initiated as point one if I’m not wrong and then what I need to do is I need
to minimize the cost function which I have defined here. Then I initialize a session and I initialize
all the Variables in the model MSC history. This is we need to store. What is the loss is
at each iteration. We are starting
with in empty list and then it’s the same function which you have doing number
of training examples. We have done thousands
of all we are doing is that run? These codes you could have used
I or anything but because we call it a box using
for air box in the range of thousand so run it
for thousand times what you need to run you need to run
the training step which was the last step in our case and all
you guys know by now that everything intensive flow
is kind of a graph and if you run the last calculation
all other calculations will be on top would be done accordingly
and feed dictionary. We have already defined
what a train X and train yr. And we can calculate the cost
for feeding into our dictionary so that we can see
how does it look like and we keep on defending
this cost to our each iteration so that we can see
how does it’s doing and here in the line of code. We are predicting. Once it has done
make the prediction for the testing ones. It should run should print me
three values the airport name and the total cost and this is actually
a more important to see that whether your cost
going down or not. So let me go at top. I have run this code all ready for you with may be
of different scale. Sometimes you have
different numbers, but if it is reducing at each I tration then probably
it’s an indication for you that your model
is doing the good job. Your weights are being learned
in the right direction and your loss is reducing so we started with .68 and if I keep on run it
for thousand times, it’s reducing almost
they may be fluctuations. Not everyone would be reducing
but if you overall see that it has gone down
then it’s an indication that your model is learning
and we have just created a plot of it all
the Thousand nitration values. We have created this plot
and what it tells you one. Your cost is going down. A pretty optimized way but it
is going down pretty neatly and I can see that this learning rate which we have selected
is doing a good job in terms of reducing an eye
at the Thousand iteration. It kind of still going down. So had it been
something like this if your graph looks like something like this
then it’s an indication that your model doesn’t require
a lot of iterations because your loss function has kind of tapered and it’s
not learning any further. If you are lost function looks
like this on the case of number of epochs then
it’s an indication that your cost is going down. Email is again evaluation
function in tensorflow, and this is again. We are plotting the same cost in
if you see that in my opinion, it’s it has not completely
flattened do it has reducing alert, not much, but still I see there is a scope that we can work
with more number of f ox and one thing. You can try at your hand when
you’ll be running these codes that you increase
the learning rate at your end and reduce the number
of epoxy and see the best shape of your graph looks
something like this. Thank you. So thank you for
the great session. I’m it. I hope all of you
found it informative. If you have any further queries
related to the session. Please comment in the comment
section below until then that’s all from our side today. Thank you and happy learning. I hope you have enjoyed
listening to this video. Please be kind enough to like it and you can comment any
of your doubts and queries and we will reply them at the earliest do look out
for more videos in our playlist And subscribe to Edureka channel to learn more. Happy learning.

Add a Comment

Your email address will not be published. Required fields are marked *